Sandybridge addmul_N challenge

Thu Feb 23 09:53:54 CET 2012

Torbjorn Granlund <tg at gmplib.org> writes:

> I think we should focus not on addmul_1 but on mul_basecase,
> sqr_basecase, redc_1, perhaps redc_2.  I.e., please focus on addmul_2
> (or addmul_N, N > 2) or vertical multiplication primitives.

Here's a sketch of an adddmul_2 iteration using Karatsuba. I assume we
have vl, vh, vd = |vl - vh| and an appropriate sign vmask in registers
before the loop. Carry input in c0, c1, carry out in r2, r3.

	mov	(up), %rax
        mov	%rax, ul
        mul	vl		C low product
        mov	8(up), uh
        mov	%rax, r0
        mov	%rdx, r1
        lea	(uh, ul), %rax
        sub	uh, ul
        cmovnc	ul, %rax
        sbb	r3, r3
        mul	vd		C Middle product
        mov	r1, r2		C Add shifted low product
        add	r0, r1
        adc	$0, r2
        add	(rp), r0	C Add rp limbs
        adc	8(rp), r1
        adc	$0, r2		
        mov	%rax, p0
        mov	%rdx, p1
        mov	uh, %rax
        mul	vh		C High product
        xor	vmask, r3
        xor	r3, p0		C Conditionally negate, and add, middle product
        xor	r3, p1
        bt	$0, r3
        adc	p0, r1
        adc	p1, r2
        adc	$0, r3
        add	%rax, r1	C Add shifted high product
        adc	%rdx, r2
        adc	$0, r2
        add	c0, r0		C Add input carry limbs
        adc	c1, r1
        mov	c0, (rp)
        mov	c1, 8(rp)
        adc	%rax, r2	C Add high product
        adc	%rdx, r3

37 instructions, or 12.25 instructions per limb, excluding looping logic
(and it has to be unrolled twice, to use separate registers for input
and output carries).

I think the instruction count can be reduced a bit, at the cost of
higher pressure on the out-of-order execution.

* At least the moves to p0, p1 can be eliminated.

* One could also save
  some instructions from adding in c0, c1 earlier, and doing an in-place
  add to (rp) at the end, on the theory that the recurrency is less tight.

* I'm also not sure if the order of the three multiplications is the
  best one.

* I don't try to optimize the add HIGH(ul * vl) + LOW(uh * vh), which
  (if additions are organized in the right way) is done twice, I suspect
  it's going to be a bit painful since the carry has to be applied at
  two places.

What do you think? If one can get one iteration to run at 12 cycles,
that's 3 c/l and an improvement over addmul_1. If one can get it down to
11 or 11.5, one beats 3 c/l.

For a "vertical" mul_basecase, the quadratic work would be an iteration
of

	mov	(up), %rax
        mul	(vp)
        add	%rax, r0
        adc	%rax, r1
        adc	$0, r3

So there's potential for that to run at 2 cycles per limb product. But
then there's also a significant linear cost for accumulation and carrry
propagation, and possible bad branch-prediction due to loops of varying
lenghts.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.