Sandybridge addmul_N challenge

Thu Feb 23 16:09:56 CET 2012

Torbjorn Granlund <tg at gmplib.org> writes:

> How did you arrive to 12.25 insns/limb?

I did someting wrong, I guess... 9.25 instructions / limb or 3.1 cycles
if we can issue 3 instructions per cycle.

> I very much doubt this will win for Sandybridge, unless you can decrease
> the insn count with several instructions.

One can decrease it a bit by adding c0, c1 earlier (do you think
recurrency can be a problem if we add c0, c1 to the first product?) and
doing an in-place add to (rp) and 8(rp) at the end.

I could get it down to 30 instructions with a deep carry recurrency, or
34 with a short one. I can get neither variant to run faster than 4 c/l.

I also had a quick look at doing Karatsuba based on (u0+u1)*(v0+v1).
It's about the same number of instructions, but the updates
from carries are independent of all products, so there's more more
freedom in where to move them around.

I think this idea may be more useful for other processors, without the
awkward hardwired mul registers. For documentation, this is what the
iteration should do:

	vd = |v1 - v0|, sign in vs (outside loop)

	ud = |u1 - u0|, sign in us
	s = us ^ vs ^ 1
	<p3, p2> = u1 * v1
	<m1, m0> = ud * vd ^ <-s, -s>
	<p1, p0> = u0 * v0

        +-----+-----+
	|p3 p2|p1 p0|
	+-----+--+--+
	      |c1 c0|
	      +-----+
	      |r1 r0|
	   +--+--+--+
	   |p1 p0|
	   +-----+
	   |p3 p2|
	+--+--+--+
	|-s| 0| s|
	+--+--+--+
	   |m1 m0|
      --+--+-----+--+
        |c3 c2 r1 r0|
        +-----------+

or

	<vc, vs> = v1 + v0 (outside loop)

	<uc, us> = u1 + u0
	<p3, p2> = u1 * v1
	<m1, m0> = us * vs
	<p1, p0> = u0 * v0

        +-----+-----+
	|p3 p2|p1 p0|
	+-----+--+--+
	      |c1 c0|
	      +-----+
	      |r1 r0|
	   +--+--+--+
	-  |p1 p0|
	   +-----+
	-  |p3 p2|
	   +-----+
	   |m1 m0|
	+--+--+--+
 	|vc vs|    if uc
	+--+--+
	   |us|    if vc
    --+--+--+-----+
      |c3 c2 r1 r0|
      +-----------+

> Perhaps some of the adc $0 could be eliminated with 2x unrolling?

In effect, that would be a kind of 4x2 multiply. Which would then be
done as two 2x2 (I think the high limbs one get from evaluation rules
out using toom32 or toom42).

Haven't tried that. I suspect one will run out of registers.

> I played a bit with mul_2 yesternight.  I am not 100% the code is
> correct, but I think it is.  The loopmixer found a 2.5 c/l version of
> it.

Nice. I've now wasted quite some time... It seems really
difficult.

Now I also tried a very basic variant of addmul_2, doing only one u limb
per iteration and multiplying it by the two v limbs. Even if I have a
very nice carry recurrence between iterations, add, adc, adc $0, four
cycles, and a small number of instructions (15 per iteration, 32 for the
twice unrolled loop), which one might think could be executed in 11
cycles or 5.5 / iteration or 2.75 cycles per limb product. But it won't
run faster than 6.5 cycles per iteration, or 3.25 c/l.

So it just seems very difficult to convince the cpu to really execute
the independent operations, outside of the recurrency, in parallel.

BTW, are any of the SSE3 etc instructions useful here?

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.