Niels Möller nisse at lysator.liu.se
Thu Feb 23 09:53:54 CET 2012

Torbjorn Granlund <tg at gmplib.org> writes:

> I think we should focus not on addmul_1 but on mul_basecase,
> (or addmul_N, N > 2) or vertical multiplication primitives.

Here's a sketch of an adddmul_2 iteration using Karatsuba. I assume we
have vl, vh, vd = |vl - vh| and an appropriate sign vmask in registers
before the loop. Carry input in c0, c1, carry out in r2, r3.

mov	(up), %rax
mov	%rax, ul
mul	vl		C low product
mov	8(up), uh
mov	%rax, r0
mov	%rdx, r1
lea	(uh, ul), %rax
sub	uh, ul
cmovnc	ul, %rax
sbb	r3, r3
mul	vd		C Middle product
mov	r1, r2		C Add shifted low product
mov	%rax, p0
mov	%rdx, p1
mov	uh, %rax
mul	vh		C High product
xor	r3, p0		C Conditionally negate, and add, middle product
xor	r3, p1
bt	\$0, r3
mov	c0, (rp)
mov	c1, 8(rp)

37 instructions, or 12.25 instructions per limb, excluding looping logic
(and it has to be unrolled twice, to use separate registers for input
and output carries).

I think the instruction count can be reduced a bit, at the cost of
higher pressure on the out-of-order execution.

* At least the moves to p0, p1 can be eliminated.

* One could also save
some instructions from adding in c0, c1 earlier, and doing an in-place
add to (rp) at the end, on the theory that the recurrency is less tight.

* I'm also not sure if the order of the three multiplications is the
best one.

* I don't try to optimize the add HIGH(ul * vl) + LOW(uh * vh), which
(if additions are organized in the right way) is done twice, I suspect
it's going to be a bit painful since the carry has to be applied at
two places.

What do you think? If one can get one iteration to run at 12 cycles,
that's 3 c/l and an improvement over addmul_1. If one can get it down to
11 or 11.5, one beats 3 c/l.

For a "vertical" mul_basecase, the quadratic work would be an iteration
of

mov	(up), %rax
mul	(vp)

So there's potential for that to run at 2 cycles per limb product. But
then there's also a significant linear cost for accumulation and carrry
propagation, and possible bad branch-prediction due to loops of varying
lenghts.

Regards,
/Niels

--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.