Sandybridge addmul_N challenge

Thu Feb 23 11:13:30 CET 2012

nisse at lysator.liu.se (Niels Möller) writes:

  Here's a sketch of an adddmul_2 iteration using Karatsuba. I assume we
  have vl, vh, vd = |vl - vh| and an appropriate sign vmask in registers
  before the loop. Carry input in c0, c1, carry out in r2, r3.

          mov	(up), %rax
          mov	%rax, ul
          mul	vl		C low product
          mov	8(up), uh
          mov	%rax, r0
          mov	%rdx, r1
          lea	(uh, ul), %rax
          sub	uh, ul
          cmovnc	ul, %rax
          sbb	r3, r3
          mul	vd		C Middle product
          mov	r1, r2		C Add shifted low product
          add	r0, r1
          adc	$0, r2
          add	(rp), r0	C Add rp limbs
          adc	8(rp), r1
          adc	$0, r2		
          mov	%rax, p0
          mov	%rdx, p1
          mov	uh, %rax
          mul	vh		C High product
          xor	vmask, r3
          xor	r3, p0		C Conditionally negate, and add, middle product
          xor	r3, p1
          bt	$0, r3
          adc	p0, r1
          adc	p1, r2
          adc	$0, r3
          add	%rax, r1	C Add shifted high product
          adc	%rdx, r2
          adc	$0, r2
          add	c0, r0		C Add input carry limbs
          adc	c1, r1
          mov	c0, (rp)
          mov	c1, 8(rp)
          adc	%rax, r2	C Add high product
          adc	%rdx, r3

  37 instructions, or 12.25 instructions per limb, excluding looping logic
  (and it has to be unrolled twice, to use separate registers for input
  and output carries).

How did you arrive to 12.25 insns/limb?  I have not tried to understand
the code, but doesn't it perform a 2x2 limb multiply with accumulation?
That's 9.25 insn/limb product.

I very much doubt this will win for Sandybridge, unless you can decrease
the insn count with several instructions.  Unfortunately it has no
chance on Bull-dozer, since the latter has a 2 issue pipeline; you need
to beat 32 insns per 2x2 accumulation block there.

  I think the instruction count can be reduced a bit, at the cost of
  higher pressure on the out-of-order execution.

Perhaps some of the adc $0 could be eliminated with 2x unrolling?

  What do you think? If one can get one iteration to run at 12 cycles,
  that's 3 c/l and an improvement over addmul_1. If one can get it down to
  11 or 11.5, one beats 3 c/l.

If that is possible, it might not be enough...  See below.

  For a "vertical" mul_basecase, the quadratic work would be an iteration
  of

  	mov	(up), %rax
          mul	(vp)
          add	%rax, r0
          adc	%rax, r1
          adc	$0, r3

  So there's potential for that to run at 2 cycles per limb product. But
  then there's also a significant linear cost for accumulation and carrry
  propagation, and possible bad branch-prediction due to loops of varying
  lenghts.

Exactly.  (But the branch misprediction problem would not happen for for
David's mulmid_basecase, I suppose.)

Some ways to deal with the branch misprediction problem:

* Have straight line code for the corners, up to a limit.  This gets rid
  of the really high relative branch misprediction for these small
  areas.

* Handle two (or more) columns in parallel, and separately for the
  low-significant right triangle, any middle rectangular part, and the
  left triangle.  This doubles (or more) the amount of useful work per
  branch misprediction.

* I suppose that making full use of out-of-order execution just before a
  mispredicted branch would make sense.

I played a bit with mul_2 yesternight.  I am not 100% the code is
correct, but I think it is.  The loopmixer found a 2.5 c/l version of
it.

I started with genxmul.c (from the loopmixer repo) using these args:
"-n2 -w4 --mul".  I then analysed the critical path and determined that
it is about 24.  The problem is adc feeding other adc feeding other adc
though a register (remember that pure carry deps are fast on
Sandybridge).  So I mindlessly introduced 4 new registers, then using
alternating accumulation registers.  I needed 8 extra insn (in total,
corresponding to 2 per way) to pairwise sum accumulation registers.

I am sure this was not done optimally, but (assuming my code is sound)
it proves that there is a lot of performance headroom, as expected.

I conjecture that we could create an addmul_N for Sandybridge that runs
at <= 2.5 c/l.  I think this will be possible already for N=2.  Perhaps
we could arrive to 2.25 for N=4, matching the K8-K10 performance.

-- 
Torbjörn