Sandybridge addmul_N challenge
Niels Möller
nisse at lysator.liu.se
Wed Feb 22 22:36:39 CET 2012
Torbjorn Granlund <tg at gmplib.org> writes:
> The sandybridge machine tom behind shell is waiting for you. A nehalem
> machine is biko* (lots of Xen machines on the same hardware) and
> repentium is a Conroe ("core2").
The best I find is
mov (up, n, 8), %rax
mul v
mov %rdx, c1
add (rp, n, 8), %rax
adc $0, c1
add %rax, c0
adc $0, c1
mov c0, (rp, n, 8)
Unrolled four times, that's 34 instructions. The best result from the
loop mixer so far has been 3.24 cycles / limb. See
shell:~nisse/hack/loopmix/lms/addmul_1-2.nlms. Which is the same as
your code, I guess.
A variant with one more instruction to move away %rax is at
shell:~nisse/hack/loopmix/lms/addmul_1.nlms seems to run at 3.52.
A variant with 7 instructions, but two add reg, mem operations also is
slow.
Regards,
/nisse
--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list