mpn_perfpow: do we need a special GCD for mp_bitcnt_t?

Thu Sep 4 15:31:05 UTC 2014

nisse at lysator.liu.se (Niels Möller) writes:

  I've had a look at x86_64/gcd_1.asm, to try to extract an gcd_11.asm
  from it.

  Before entering the main binary gcd loop, the code needs to:

  1. Remove and keep track of common trailing zeros,

  2. Remove trailing zeros on one of the arguments (that might be cheaper
     than checking which argument is odd already and swap them, not sure).

  But the gcd_1.asm code also has a call to mpn_modexact_1_odd in the case
  that u0 >> v0. I'm not sure this makes sense. If something like this is
  kept at all, I'd suggest just getting an 8-bit v0-inverse from
  binvert_limb_table, and rather than computing a larger inverse, just
  loop reducing 8 bits at a time as long as u0 > v0. But one get some
  overhead. Before the main loop one needs the following steps:

  1. Remove and keep track of common trailing zeros,

  2. Remove trailing zeros on *both* arguments (one will be odd, but we
     don't yet know which).

  3. Swap u0, v0 if u0 < v0

  4. Check if u0 is much larger than v0, if not, enter main loop directly.

  Do you think that's worth the effort?

This is trickier than I had realised, and needs some thinking to avoid
performance regressions.

It might make sense to have an entry point for log2(a) \approx log2(b)
allowing even operands, suitable for fresh-out-of multilimb mod
reduction.  It might perhaps make sense to have an entry point for
operands which are also both odd.  And we need an entry point allowing
any two limbs (perhaps requiring non-zero arguments).

Do you agree?  Did I forget some scenario?

Many entry points are simple to handle in the asm code.  In C, we just
need multiple separate functions.

Torbjörn
Please encrypt, key id 0xC8601622