Sandybridge addmul_N challenge

Niels Möller nisse at lysator.liu.se
Wed Feb 22 22:36:39 CET 2012


Torbjorn Granlund <tg at gmplib.org> writes:

> The sandybridge machine tom behind shell is waiting for you.  A nehalem
> machine is biko* (lots of Xen machines on the same hardware) and
> repentium is a Conroe ("core2").

The best I find is

        mov     (up, n, 8), %rax
        mul     v
        mov     %rdx, c1
        add     (rp, n, 8), %rax
        adc     $0, c1
        add     %rax, c0
        adc     $0, c1
        mov     c0, (rp, n, 8)

Unrolled four times, that's 34 instructions. The best result from the
loop mixer so far has been 3.24 cycles / limb. See
shell:~nisse/hack/loopmix/lms/addmul_1-2.nlms. Which is the same as
your code, I guess.

A variant with one more instruction to move away %rax is at
shell:~nisse/hack/loopmix/lms/addmul_1.nlms seems to run at 3.52.

A variant with 7 instructions, but two add reg, mem operations also is
slow.

Regards,
/nisse

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.


More information about the gmp-devel mailing list