Sandybridge addmul_N challenge
Torbjorn Granlund
tg at gmplib.org
Thu Feb 23 07:51:26 CET 2012
nisse at lysator.liu.se (Niels Möller) writes:
The best I find is
mov (up, n, 8), %rax
mul v
mov %rdx, c1
add (rp, n, 8), %rax
adc $0, c1
add %rax, c0
adc $0, c1
mov c0, (rp, n, 8)
Unrolled four times, that's 34 instructions. The best result from the
loop mixer so far has been 3.24 cycles / limb. See
shell:~nisse/hack/loopmix/lms/addmul_1-2.nlms. Which is the same as
your code, I guess.
My code looked like 3.16 in the loopmixer, then runs at 3.25 outside of
it. If you can make your smaller code actually run at 3.25, it is an
improvement.
I think we should focus not on addmul_1 but on mul_basecase,
sqr_basecase, redc_1, perhaps redc_2. I.e., please focus on addmul_2
(or addmul_N, N > 2) or vertical multiplication primitives.
--
Torbjörn
More information about the gmp-devel
mailing list