Sandybridge addmul_N challenge
Torbjorn Granlund
tg at gmplib.org
Thu Feb 23 11:13:30 CET 2012
nisse at lysator.liu.se (Niels Möller) writes:
Here's a sketch of an adddmul_2 iteration using Karatsuba. I assume we
have vl, vh, vd = |vl - vh| and an appropriate sign vmask in registers
before the loop. Carry input in c0, c1, carry out in r2, r3.
mov (up), %rax
mov %rax, ul
mul vl C low product
mov 8(up), uh
mov %rax, r0
mov %rdx, r1
lea (uh, ul), %rax
sub uh, ul
cmovnc ul, %rax
sbb r3, r3
mul vd C Middle product
mov r1, r2 C Add shifted low product
add r0, r1
adc $0, r2
add (rp), r0 C Add rp limbs
adc 8(rp), r1
adc $0, r2
mov %rax, p0
mov %rdx, p1
mov uh, %rax
mul vh C High product
xor vmask, r3
xor r3, p0 C Conditionally negate, and add, middle product
xor r3, p1
bt $0, r3
adc p0, r1
adc p1, r2
adc $0, r3
add %rax, r1 C Add shifted high product
adc %rdx, r2
adc $0, r2
add c0, r0 C Add input carry limbs
adc c1, r1
mov c0, (rp)
mov c1, 8(rp)
adc %rax, r2 C Add high product
adc %rdx, r3
37 instructions, or 12.25 instructions per limb, excluding looping logic
(and it has to be unrolled twice, to use separate registers for input
and output carries).
How did you arrive to 12.25 insns/limb? I have not tried to understand
the code, but doesn't it perform a 2x2 limb multiply with accumulation?
That's 9.25 insn/limb product.
I very much doubt this will win for Sandybridge, unless you can decrease
the insn count with several instructions. Unfortunately it has no
chance on Bull-dozer, since the latter has a 2 issue pipeline; you need
to beat 32 insns per 2x2 accumulation block there.
I think the instruction count can be reduced a bit, at the cost of
higher pressure on the out-of-order execution.
Perhaps some of the adc $0 could be eliminated with 2x unrolling?
What do you think? If one can get one iteration to run at 12 cycles,
that's 3 c/l and an improvement over addmul_1. If one can get it down to
11 or 11.5, one beats 3 c/l.
If that is possible, it might not be enough... See below.
For a "vertical" mul_basecase, the quadratic work would be an iteration
of
mov (up), %rax
mul (vp)
add %rax, r0
adc %rax, r1
adc $0, r3
So there's potential for that to run at 2 cycles per limb product. But
then there's also a significant linear cost for accumulation and carrry
propagation, and possible bad branch-prediction due to loops of varying
lenghts.
Exactly. (But the branch misprediction problem would not happen for for
David's mulmid_basecase, I suppose.)
Some ways to deal with the branch misprediction problem:
* Have straight line code for the corners, up to a limit. This gets rid
of the really high relative branch misprediction for these small
areas.
* Handle two (or more) columns in parallel, and separately for the
low-significant right triangle, any middle rectangular part, and the
left triangle. This doubles (or more) the amount of useful work per
branch misprediction.
* I suppose that making full use of out-of-order execution just before a
mispredicted branch would make sense.
I played a bit with mul_2 yesternight. I am not 100% the code is
correct, but I think it is. The loopmixer found a 2.5 c/l version of
it.
I started with genxmul.c (from the loopmixer repo) using these args:
"-n2 -w4 --mul". I then analysed the critical path and determined that
it is about 24. The problem is adc feeding other adc feeding other adc
though a register (remember that pure carry deps are fast on
Sandybridge). So I mindlessly introduced 4 new registers, then using
alternating accumulation registers. I needed 8 extra insn (in total,
corresponding to 2 per way) to pairwise sum accumulation registers.
I am sure this was not done optimally, but (assuming my code is sound)
it proves that there is a lot of performance headroom, as expected.
I conjecture that we could create an addmul_N for Sandybridge that runs
at <= 2.5 c/l. I think this will be possible already for N=2. Perhaps
we could arrive to 2.25 for N=4, matching the K8-K10 performance.
--
Torbjörn
More information about the gmp-devel
mailing list