GCD project status?
Torbjörn Granlund
tg at gmplib.org
Sun Sep 22 22:34:34 UTC 2019
nisse at lysator.liu.se (Niels Möller) writes:
I see two ways. Either just do an inversion table, which will cost an
additional multiplication when used. Than one could handle quotients up
to 7 or 8 bits bits with just 256 bytes of table.
Or shift both ni and di, and get quotient with something like
q = tabp[(ni << NBITS) + di] >> (NBITS - (dcnt - ncnt))
Then one could handle quotients up to 5 bits with a table of 1024 bytes
(indexing by 5 significant bits of each of n and d, excluding the most
significant one bit).
I went ahead and pushed a hgcd2.c with _METHOD = 4 and _METHOD 5. The
former tables quotients and neeeds one multiply per div1 call and the
latter tables inverses and needs 2 multiplies per div1 call.
The quotient table method isn't pretty. It can surely be improved. It
uses a very large table and gets poor throughput. It also has only
about 80% branch hit rate, which is not good enough. Niels suggested a
variant with better progress (I get quotient with at most NBITS/2 bits
in the fast path while Niels expects NBITS) when extracting NBITS
numerator and NBITS quotient bits. I would be happy to see a variant
with NBITS progress.
The inverse table method is more mature, I believe. It has much smaller
tables and with the default table size (256 2-byte entries) it gets a
branch hit rate of 97%. Doubling the table results in 98.5% rate.
Benchmark? The inverse table method seems to be the fastest div1 thus
far for some current Intel processors.
--
Torbjörn
Please encrypt, key id 0xC8601622
More information about the gmp-devel
mailing list