Sandybridge addmul_N challenge

Niels Möller nisse at
Thu Feb 23 21:00:41 CET 2012

Torbjorn Granlund <tg at> writes:

> In loopmixer or manually?  I wouldn't draw any conclusions without
> mixing the code first...

With the loop mixer.

> Meaning evaluating in +1 instead of -1, I assume.


> Did you compute the recurrency chain?  Annotating the instructions on
> the recurrency chain helps understanding the problem.

I tried. I use this iteration 

        mov     (up, n, 8), %rax
        mov     %rax, u
        mul     v0
        mov     %rax, r0
        mov     %rdx, r1
        mov     u, %rax
        mul     v1
        mov     (rp, n, 8), t 
        add     t, r0  
        add     %rax, r1
        mov     %rdx, r2
        adc     $0, r2
        add     c0, r0
	mov     r0, (rp, n, 8)
	adc     c1, r1
        adc     $0, r2

For the recurrency, the inputs are c0, c1, and the outputs are r1, r2.
Let's write the interesting instructions out and unroll twice (using
different registers),

        add     c0, r0		C  0  6
	adc     c1, r1		C  1  7
        adc     $0, r2		C  3  9

        add     r1, c2		C  3  9
	adc     r2, c0		C  4  10
	adc     $0, c1		C  6  12

So the recurrency, for one iteration, seems to be just 3 cycles. But the
loop mixer doesn't find anything faster then 6.36 cycles for one
iteration, or 3.18 per limb product. Which isn't too bad (a slight
improvement over 3.24, which I think is the best reported earlier), but
stubbornly above 3 c/l.

> My experience of Sandybridge is that with load/store coding style, the
> CPU typically executes 3 insn/cycle unless there is a recurrency
> dependency stopping that.

If we could get there, the above loop should run just below 3 c/l.

> I don't think there are.  These instructions are mostly FP plus narrow
> integer ops.  

Hmm. Last time I looked at that was in a 32-bit context. There's a
32x32->64 instruction which might be useful for a 32-bit build, at least
in theory, but as far as I can find in the manual, the latest
ss*-extensions don't provide any wider multiplication than that.


Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

More information about the gmp-devel mailing list