hgcd1/2

Torbjörn Granlund tg at gmplib.org
Sat Sep 7 09:10:45 UTC 2019


nisse at lysator.liu.se (Niels Möller) writes:

  I don't know much about itanic performance.

It's not bad for GMP, but its division *libcall* needs some 150 cycles.

I played with a METHOD = 3 div1 for Itanic as its predicated execution
comes in very handy.  The result is that one can do one quotient bit per
cycle.  (Yes, Itanic is long irrelevant, but its bizarre architecture is
fun to play with.)

  > _METHOD 1 benefits from the special handling of q = 1 before invoking
  > DIV1.  _METHOD 2 & 3 are optimised for small q, so implicitly repeats
  > the special handling of q = 1.

  Would be interesting to try to take that out (except for _METHOD 2).

Both METHOD 2 and 3 handles q = 1 well and efficiently.  METHOD 1 might
benefit from special q = 1 handling.

(Of course, the suppressed multiplication will help some machines,
irrespective of METHOD.)

  The current _METHOD 3 code starts with

    if (UNLIKELY ((d0 >> (GMP_LIMB_BITS - 3)) != 0)
        || UNLIKELY (n0 >= (d0 << 3)))

  The first condition isn't that unlikely, though, and will get less
  unlikely when we change hgcd2 to use div1 in some way also for the
  double limb loop. Since the first few iterations of hgcd2 work with
  numbers close to filling two limbs.

I thought 2-3 of about 40 iterations would suffer from that condition.
Don't you agree?

  Coulds do the shifting in the other direction instead:

  static inline mp_double_limb_t
  div1 (mp_limb_t n0, mp_limb_t d0)
  {
    mp_double_limb_t res;
    if (UNLIKELY ((n0 >> 3) >= d0))
      {
        res.d1 = n0 / d0;
        res.d0 = n0 - res.d1 * d0;
      }
    else
      {
        mp_limb_t q, t, mask;

        t = n0 >> 2;
        mask = -(mp_limb_t) (t >= d0);

        q = 4 & mask;
        n0 -= (d0 << 2) & mask;

        t = n0 >> 1;
        mask = -(mp_limb_t) (t >= d0);
        q += 2 & mask;
        n0 -= (d0 << 1) & mask;

        mask = -(mp_limb_t) (n0 >= d0);
        q += 1 & mask;
        n0 -= d0 & mask;

        res.d0 = n0;
        res.d1 = q;
      }
    return res;
  }

  It has the drawback that it adds dependencies on the n0 path, so not
  sure if it can be faster.

The quotient might be underestimated, cannot it?

-- 
Torbjörn
Please encrypt, key id 0xC8601622


More information about the gmp-devel mailing list