Sandybridge addmul_N challenge

Wed Feb 22 11:18:52 CET 2012

Torbjorn Granlund <tg at gmplib.org> writes:

> But our multiplication loops for Sandybridge are much worse.  The
> current addmul_1 runs at 4 c/l and addmul_2 runs at about 3.4 c/l.  I
> have new code running at 3.4 c/l and 3.3 c/l respectively.

Let me think aloud for a bit.

I'm still not sure what the addmul_1 bottleneck is. I note that in the
general x86_64 and core2 code, for each mul, we do two *dependent* adds
before storing the low limb. With the cin and cout for the carrry limbs,
the operations per limb are basically

  mov   (up), %rax
  mul	v
  add	%rax, cin
  xor	cout, cout
  adc	$0, cout
  add	cin, (rp)
  adc	%rdx, cout

The real code is a lot more clever with unrolling and scheduling, but if
I read it correctly, these are the operations done.

One could reorder this as follows,

  mov   (up), %rax
  mul	v
  xor	cout, cout
  add	(rp), cin
  adc	$0, cout
  add	%rax, cin
  adc	%rdx, cout
  mov	cin, (rp)

Then we get a bit more independence of operations, since the first add
is independent of the result of the mul. It costs an instruction with
the extra mov to rp. But I suspect that's still better than updating
memory twice, like

  mov   (up), %rax
  mul	v
  xor	cout, cout
  add	cin, (rp)
  adc	$0, cout
  add	%rax, (rp)
  adc	%rdx, cout

The recurrency depth seems to be the same in all cases, though, with a
latency of add + adc + adc from cin to cout. If that's what's killing
performance, maybe this would be better,

  mov   (up), %rax
  mul	v
  xor	cout, cout
  add	(rp), %rax
  adc	%rdx, cout
  add	%rax, cin
  adc	$0, cout
  mov	cin, (rp)

with a recurrency latency of only add + adc (where the adc in question
has a $0 source operand). If I understand you correctly, that would be
only two cycles on sandybridge. It seems the current sbr code does
something similar? Then we have to rely on scheduling or out-of-order
execution to not get a useless wait between the mul and the first add.

Is instruction issue still limited to three instructions per cycle? Then
that's a more narrow bottleneck than both latency and multiplication
throughput. The above is eight instructions per limb, excluding looping.
So one could at least *hope* to get down to 3 cycles with a moderate
level of unrolling.

Maybe one could try a variant updating (rp) twice, to save an
instruction:

  mov   (up), %rax
  mul	v
  xor	cout, cout
  add	%rax, (rp)
  adc	%rdx, cout
  add	cin, (rp)
  adc	$0, cout

Is it possible to squeeze those three memory instructions in less than 3
cycles?

If we try to get the instruction count below 9 (like above), then
there's not much room for moving around %rax and %rdx. But the critical
recurrency, "add cin, ...; adc $0, cout", doesn't involve %rax and %rdx,
leaving a bit freedom to move it between iterations if we unroll and use
multiple registers for cin and cout. Hmm, let me give one more variant,
which moves %rax and %rdx out of the way, at the cost of one more
instruction (8 instructions per limb):

  mov   (up), %rax
  mul	v
  mov	%rax, pl
  mov	%rdx, cout
  add	pl, (rp)
  adc	$0, cout
  add	cin, (rp)
  adc	$0, cout

I guess one ought to bring out the loop mixer to find out if any of this
really can run at 3 cycles.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.