mul_fft

Wed Jun 30 21:21:17 UTC 2021

nisse at lysator.liu.se (Niels Möller) writes:

  I think add_n_sub_n was originally motivated by improved locality (could
  apply at different levels of memory hierarcy).

  But maybe we could get close to twice the speed using newer instructions
  with multiple carry flags (I guess that's what
  powerpc64/mode64/p9/add_n_sub_n.asm is doing)? We could probably do
  something similar on x86_64 with adox and adcx (if corresponding
  subtract instructions are missing, on-the-fly negation should be fairly
  efficient).

Yes, p9/add_n_sub_n.asm takes advantage of new addex instructions which
use alternative carry bits.

The POWER processors have since P4 a unique pipeline where all
instructions have 2t cycles latency for some integer t.  Plain integer
ops have a latency of 2. (It is really as if the CPUs run at half the
impressive advertised clock frequency with twice the assumed dispatch
width.)

Because of this, add_n and sub_n cannot be made to run at less than 2
c/l.  And add_n_sub_n is close at 2.25 c/l on P9.

Unfortunately, x86 is not as easy.  The subtract-with-borrow
instructions clobber all flags, making the new adcx/adox pretty
pointless.  And no, there are no non-clobbering negation (or one's
complement) instructions.  One needs to save and restore carry.  There
will be many register-to-register copy instructions, because of x86's
habit of overwriting source operands.  (Such copies are essentially free
on AMD processors, as they only manipulate the register rename map.)

Another example where there is just one carry bit is arm64.  For Apple
M1, as you can see at https://gmplib.org/devel/asm, we have an
add_n_sub_n which runs at just 1.3 c/l.  (Almost all GMP loops,
including add_n and sub_n run at exactly 1 c/l there, amazingly.)

-- 
Torbjörn
Please encrypt, key id 0xC8601622