Sandybridge addmul_N challenge
Torbjorn Granlund
tg at gmplib.org
Thu Feb 23 17:44:21 CET 2012
nisse at lysator.liu.se (Niels Möller) writes:
One can decrease it a bit by adding c0, c1 earlier (do you think
recurrency can be a problem if we add c0, c1 to the first product?) and
doing an in-place add to (rp) and 8(rp) at the end.
I could get it down to 30 instructions with a deep carry recurrency, or
34 with a short one. I can get neither variant to run faster than 4 c/l.
In loopmixer or manually? I wouldn't draw any conclusions without
mixing the code first...
I also had a quick look at doing Karatsuba based on (u0+u1)*(v0+v1).
Meaning evaluating in +1 instead of -1, I assume.
It's about the same number of instructions, but the updates from
carries are independent of all products, so there's more more freedom
in where to move them around.
I think this idea may be more useful for other processors, without the
awkward hardwired mul registers.
True.
> I played a bit with mul_2 yesternight. I am not 100% the code is
> correct, but I think it is. The loopmixer found a 2.5 c/l version of
> it.
Nice. I've now wasted quite some time... It seems really difficult.
It is challenging, but I am getting convinced we can really speed things
a lot.
Now I also tried a very basic variant of addmul_2, doing only one u limb
per iteration and multiplying it by the two v limbs. Even if I have a
very nice carry recurrence between iterations, add, adc, adc $0, four
cycles, and a small number of instructions (15 per iteration, 32 for the
twice unrolled loop), which one might think could be executed in 11
cycles or 5.5 / iteration or 2.75 cycles per limb product. But it won't
run faster than 6.5 cycles per iteration, or 3.25 c/l.
So it just seems very difficult to convince the cpu to really execute
the independent operations, outside of the recurrency, in parallel.
Did you compute the recurrency chain? Annotating the instructions on
the recurrency chain helps understanding the problem.
My experience of Sandybridge is that with load/store coding style, the
CPU typically executes 3 insn/cycle unless there is a recurrency
dependency stopping that.
BTW, are any of the SSE3 etc instructions useful here?
I don't think there are. These instructions are mostly FP plus narrow
integer ops.
--
Torbjörn
More information about the gmp-devel
mailing list