Small operands gcd improvements

Thu Aug 8 10:02:29 UTC 2019

Some random thoughts on our gcd code:

I forgot to create all needed "grabber" files for gcd_11, so we got
slowdown for many x86_64 CPUs.

One thing which blurs measurements is that the gcd_1 files use a
plethora of strategies for initial reduction given 1 x 1.  Some check
their relative sizes and at a hardwired threshold reduce the larger one
with some sort of modulo operation.  Here a plain hardware instruction
is sometimes used, but also a call to mpn_modexact_1c_odd!

I some cases, I fail to understand the gcd_1/gcd_11 speed differences.
On Cortex-A15 gcd_1 measures at around 3.4 c/b while gcd_11 gives around
4.  And that's with the exact same code in the loop aligned to the same
boundary, etc, etc.  (Note that what's checked in now use different
code, my experiments was with non-comitted code.)

The automated measurements of gcd_1 and gcd_11 use different parameters.
That was a mistake.  Fix made for next round.

  I've been considering

  void 
  mpn_gcd_22 (mp_ptr rp, 
              mp_limb_t u1, mp_limb_t u0, 
              mp_limb_t v1, mp_limb_t v0);

  with the requirement that u0 and v0 are odd. Do you prefer something
  different? I'm thinking that it should not be a public function.

I agree about the critera for oddness.  We might also require v1 != 0,
u1 != 0, but then again that might be unnecessary.

Should we brave to return struct{mp_limb_t g0; mp_limb_t g1;} instead of
having an rp parameter?

  > The last loop will be 11.  We can simply inline a copy here as it is
  > tiny.  (A tail call won't work as the functions will have different
  > return types.)

  Except if we have lots of tuned variants of gcd_11 and fewer gcd_22;
  then gcd_22 ought to call the separately selected gcd_11.

We need to inline the right version....

-- 
Torbjörn
Please encrypt, key id 0xC8601622