Sandybridge addmul_N challenge
Niels Möller
nisse at lysator.liu.se
Thu Feb 23 09:53:54 CET 2012
Torbjorn Granlund <tg at gmplib.org> writes:
> I think we should focus not on addmul_1 but on mul_basecase,
> sqr_basecase, redc_1, perhaps redc_2. I.e., please focus on addmul_2
> (or addmul_N, N > 2) or vertical multiplication primitives.
Here's a sketch of an adddmul_2 iteration using Karatsuba. I assume we
have vl, vh, vd = |vl - vh| and an appropriate sign vmask in registers
before the loop. Carry input in c0, c1, carry out in r2, r3.
mov (up), %rax
mov %rax, ul
mul vl C low product
mov 8(up), uh
mov %rax, r0
mov %rdx, r1
lea (uh, ul), %rax
sub uh, ul
cmovnc ul, %rax
sbb r3, r3
mul vd C Middle product
mov r1, r2 C Add shifted low product
add r0, r1
adc $0, r2
add (rp), r0 C Add rp limbs
adc 8(rp), r1
adc $0, r2
mov %rax, p0
mov %rdx, p1
mov uh, %rax
mul vh C High product
xor vmask, r3
xor r3, p0 C Conditionally negate, and add, middle product
xor r3, p1
bt $0, r3
adc p0, r1
adc p1, r2
adc $0, r3
add %rax, r1 C Add shifted high product
adc %rdx, r2
adc $0, r2
add c0, r0 C Add input carry limbs
adc c1, r1
mov c0, (rp)
mov c1, 8(rp)
adc %rax, r2 C Add high product
adc %rdx, r3
37 instructions, or 12.25 instructions per limb, excluding looping logic
(and it has to be unrolled twice, to use separate registers for input
and output carries).
I think the instruction count can be reduced a bit, at the cost of
higher pressure on the out-of-order execution.
* At least the moves to p0, p1 can be eliminated.
* One could also save
some instructions from adding in c0, c1 earlier, and doing an in-place
add to (rp) at the end, on the theory that the recurrency is less tight.
* I'm also not sure if the order of the three multiplications is the
best one.
* I don't try to optimize the add HIGH(ul * vl) + LOW(uh * vh), which
(if additions are organized in the right way) is done twice, I suspect
it's going to be a bit painful since the carry has to be applied at
two places.
What do you think? If one can get one iteration to run at 12 cycles,
that's 3 c/l and an improvement over addmul_1. If one can get it down to
11 or 11.5, one beats 3 c/l.
For a "vertical" mul_basecase, the quadratic work would be an iteration
of
mov (up), %rax
mul (vp)
add %rax, r0
adc %rax, r1
adc $0, r3
So there's potential for that to run at 2 cycles per limb product. But
then there's also a significant linear cost for accumulation and carrry
propagation, and possible bad branch-prediction due to loops of varying
lenghts.
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list