Small operands gcd improvements

Wed Aug 14 21:07:26 UTC 2019

Another trick we've been discussing is to use u+v rather than |u-v|, in
the case that u+v is divisible by 4.

I've not yet tried it out for gcd_22 (and I don't yet have any timing
code), but I've tried it for gcd_11. It's slower (not that surprising),
but with some potential for gcd_22.

We need masking tricks to select between the three values u-v, v-u and
u+v, depending on which is both positive and divisible by 4. And
regardless, we still need u-v (or at least the comparison), for the
computation of min(u,v). The below seems to work. Looking at the
interesting operations, it uses

        d = u - v;
        vgtu = LIMB_HIGHBIT_TO_MASK (d);
        v += (vgtu & d);

to get v <-- min (u,v). This is unchanged, except that t was renamed to
d for difference.

To get the other value, always divisible by 4 (or by 2 in the implicit
bit representation), the code uses

        t = u ^ v;
        use_add = -(t & 1);     /* The mask value */
        t = u - (use_add ^ v);  /* I'm clever here  */
        vgtu &= ~use_add;       /* Cancel negation if we use add */
        u = (t ^ vgtu) - vgtu;

The cleverness is that in the add case, we need u + v + 1 due to the
carry from implicit bits, but we get by with just complement thanks to
the relation between complement and negation, -x = ~x + 1 or -(x+1) =
~x.

So if we try something similar for gcd_22, we'll get two sub_ddmmss, but
at they're idependent, we don't add any carry propagation to the
critical path.

I see some possible micro-optimizations in the below code, but I doubt
it can be made faster than the regular gcd_11.

  mp_limb_t
  mpn_gcd_11 (mp_limb_t u, mp_limb_t v)
  {
    ASSERT (u & v & 1);

    /* In this loop, we represent the odd numbers ulimb and vlimb
       without the redundant least significant one bit. This reduction
       in size by one bit ensures that the high bit of t, below, is set
       if and only if vlimb > ulimb. */

    u >>= 1;
    v >>= 1;

    while (u != v)
      {
        mp_limb_t t, d;
        mp_limb_t vgtu, use_add;
        int c;

        t = u ^ v;

        /* If least significant bit of t is 1, then 
  	 u + v = ...01 + ...11 = ...00, i.e., divisible by 4 */ 
        use_add = -(t & 1);

        d = u - v;
        /*  u - ~v = u + v + 1 */
        t = u - (use_add ^ v);
        vgtu = LIMB_HIGHBIT_TO_MASK (d);

        /* v <-- min (u, v) */
        v += (vgtu & d);

        /* u <-- |u - v|, but only if we're not doing u + v */
        vgtu &= ~use_add;
        u = (t ^ vgtu) - vgtu;

        count_trailing_zeros (c, t);
        /* We have c <= GMP_LIMB_BITS - 2 here, so that

  	   ulimb >>= (c + 1);

  	 would be safe. But unlike the addition c + 1, a separate
  	 shift by 1 is independent of c, and can be executed in
  	 parallel with count_trailing_zeros. */
        u = (u >> 1) >> c;
      }
    return (u << 1) + 1;
  }

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.