Sandybridge addmul_N challenge

Thu Feb 23 17:44:21 CET 2012

nisse at lysator.liu.se (Niels Möller) writes:

  One can decrease it a bit by adding c0, c1 earlier (do you think
  recurrency can be a problem if we add c0, c1 to the first product?) and
  doing an in-place add to (rp) and 8(rp) at the end.

  I could get it down to 30 instructions with a deep carry recurrency, or
  34 with a short one. I can get neither variant to run faster than 4 c/l.

In loopmixer or manually?  I wouldn't draw any conclusions without
mixing the code first...

  I also had a quick look at doing Karatsuba based on (u0+u1)*(v0+v1).

Meaning evaluating in +1 instead of -1, I assume.

  It's about the same number of instructions, but the updates from
  carries are independent of all products, so there's more more freedom
  in where to move them around.

  I think this idea may be more useful for other processors, without the
  awkward hardwired mul registers.

True.

  > I played a bit with mul_2 yesternight.  I am not 100% the code is
  > correct, but I think it is.  The loopmixer found a 2.5 c/l version of
  > it.

  Nice. I've now wasted quite some time... It seems really difficult.

It is challenging, but I am getting convinced we can really speed things
a lot.

  Now I also tried a very basic variant of addmul_2, doing only one u limb
  per iteration and multiplying it by the two v limbs. Even if I have a
  very nice carry recurrence between iterations, add, adc, adc $0, four
  cycles, and a small number of instructions (15 per iteration, 32 for the
  twice unrolled loop), which one might think could be executed in 11
  cycles or 5.5 / iteration or 2.75 cycles per limb product. But it won't
  run faster than 6.5 cycles per iteration, or 3.25 c/l.

  So it just seems very difficult to convince the cpu to really execute
  the independent operations, outside of the recurrency, in parallel.

Did you compute the recurrency chain?  Annotating the instructions on
the recurrency chain helps understanding the problem.

My experience of Sandybridge is that with load/store coding style, the
CPU typically executes 3 insn/cycle unless there is a recurrency
dependency stopping that.

  BTW, are any of the SSE3 etc instructions useful here?

I don't think there are.  These instructions are mostly FP plus narrow
integer ops.  

-- 
Torbjörn