GCD project status?

Sun Sep 22 22:34:34 UTC 2019

nisse at lysator.liu.se (Niels Möller) writes:

  I see two ways. Either just do an inversion table, which will cost an
  additional multiplication when used. Than one could handle quotients up
  to 7 or 8 bits bits with just 256 bytes of table.

  Or shift both ni and di, and get quotient with something like

    q = tabp[(ni << NBITS) + di] >> (NBITS - (dcnt - ncnt))

  Then one could handle quotients up to 5 bits with a table of 1024 bytes
  (indexing by 5 significant bits of each of n and d, excluding the most
  significant one bit).

I went ahead and pushed a hgcd2.c with _METHOD = 4 and _METHOD 5.  The
former tables quotients and neeeds one multiply per div1 call and the
latter tables inverses and needs 2 multiplies per div1 call.

The quotient table method isn't pretty.  It can surely be improved.  It
uses a very large table and gets poor throughput.  It also has only
about 80% branch hit rate, which is not good enough.  Niels suggested a
variant with better progress (I get quotient with at most NBITS/2 bits
in the fast path while Niels expects NBITS) when extracting NBITS
numerator and NBITS quotient bits.  I would be happy to see a variant
with NBITS progress.

The inverse table method is more mature, I believe.  It has much smaller
tables and with the default table size (256 2-byte entries) it gets a
branch hit rate of 97%.  Doubling the table results in 98.5% rate.

Benchmark?  The inverse table method seems to be the fastest div1 thus
far for some current Intel processors.

-- 
Torbjörn
Please encrypt, key id 0xC8601622