Some arm cortex-a8 improvements
Torbjorn Granlund
tg at gmplib.org
Mon Apr 23 16:49:55 CEST 2012
Richard Henderson <rth at twiddle.net> writes:
Indeed I know that the hw registers that allow such recognition
are all privileged. For linux they best one can do is /proc/cpuinfo
or (to some extent) the values in AT_HWCAP.
Something portable would be nice...
FYI, I dug out the add/mul_2.asm files I was working in in February.
IIRC, they're correct as in they pass the testsuite, but I could not
show them to be faster than the add/mul_1 paths.
Do you know the repeat rate of umull, umlal, umaal, assuming no reg
dependencies?
Usually, it is possible to come close to the mul throughput for some
addmul_N, N >= 1.
Forget mul_2, go for addmul_2, since the latter will be used repeatedly
from mul_basecase or sqr_basecase.
One will want to do latency scheduling for umaal, handling v0 (the low
limb of the 2-limb v operand) and v1 (its high limb) semi-seprately; one
first multiplies with v0, then limping along some cycles later,
multiply-adds v1. I suppose umaal + umaal or umaal + umlal would both
work.
--
Torbjörn
More information about the gmp-devel
mailing list