Small operands gcd improvements

Tue Aug 6 19:08:23 UTC 2019

nisse at lysator.liu.se (Niels Möller) writes:

  Right, mpn_gcd usually ends with a call to mpn_gcd_1 or gcd_2, and the
  latter also usually ends with a call to gcd_1. But I think it's easiest
  to leave that as is until we have good gcd_22.

OK, but let's not forget about it!

  > I expect asm gcd_1 to disappear as the C code should be equivalent.  Do
  > you agree?

  That would be nice. There will be one more function call, but hopefully
  that's not going to be a significant performance regression.

Perhaps it would be worthwhile to do a tailcall for when the leftshift
is 0?  I think most gcd calls, irrespective of operand size, will not
have any factor of two, let alone a common one!

  > I suppose some hardwired stuff for the case u >> v (not bitshift, the
  > mathematical meaning of >>!) might want to be parameterised and also
  > ideally tune/tuneup'ed.

  Should that be done by gcd_1 only? Or do we need some variant of gcd_11
  with an initial division?

I suppose we should have a lowest level without that (conditional or
unconditional) and that that level is gcd_11.

  My workstation (intel broadwell) uses takes 96 cycles for a division (if
  I read https://gmplib.org/~tege/x86-timing.pdf, or is there are faster
  64/64 div isntruction?). And gcd_11 runs at roughly 4 cycles per input
  bit according to speed. So then threshold should be around 24.

Modern Intel processors have different code for 128b/64b and 64b/64b
division.  The former is very slow while the latter apparently uses a
high-radix SRT.

AMD Ryzen has fast 128b/64b (SRT-4 I think)..

  Below patch to add a gcd_11 entrypoint for this arch. Passes make check,
  but would be good to also test with devel/try.

Ah, so you suggest that we keep asm gcd_1 after all.  I was in the
editing process of removing them!  :-)

-- 
Torbjörn
Please encrypt, key id 0xC8601622