hgcd1/2

Mon Sep 16 07:02:58 UTC 2019

tg at gmplib.org (Torbjörn Granlund) writes:

> Not exactly.  I think running div1 like hgcd2 but without any of hgcd2
> bookkeeping would make some sense.  I.e., feed div1 with the input of
> Euclid's algorithm.  To avoid skew from particular operands, perhaps
> table 10 uniformly distrinuted random numbers, then do all (a,b) pairs
> from that table.
>
> The idea is to make measurements more accurate.

Makes sense. If better accuracy is needed.

> I've written several div1 in asm (arm v5 method 2, 64-bit arm v8 method
> 2 and 3, 64-bit x86 method 2).  

Nice.

> Since these will not be static or inlined, I call them mpn_div_11. I
> am not sure how to time them within your framework.

I think adding a case

#if HAVE_NATIVE_MPN_DIV_11 
#define div1 mpn_div_11
#else
... other versions depending on HGCD2_DIV1_METHOD
#endif

in hgcd2.c, and then using ./tune/speed mpn_hgcd2, should work. To not
break tuneup of the C versions, one would need to amend that to

#if HAVE_NATIVE_MPN_DIV_11 && !TUNING_BUILD

and add a #define TUNING_BUILD 1 to the various tune/hgcd2_*.c files.
(Or maybe some other define than TUNING_BUILD, since this is meant to
affect both the tuneup and the speed programs).

> I now believe a table-based thing which gets the quotient right > 90% of
> the time will be the best approach.

What should the table look like? If one looks up only the divisor bits,
(i.e., a reciprocal table) one gets better precision with smaller table
size. If one looks up bits of both inputs, table gets larger. One would
then tabulate something like 1uuuuu / 1.ddddd, and shift the result
depending on the bit size difference.

> And perhaps hgcd2 could be made to cope with inaccurate quotients, and
> that without itself executing unpredictable branches? As long as
> progress is made, perhaps sloppy quotients can be tolerated?

After a' = a - q*b, we currently expect 0 <= a' < b (and exit if a' is
close to zero). Allowing a' < 0 is not an easy change; one needs an
absolute value, and maybe a way to deal with negative matrix elements.

Allowing a' > b means that next step needs another a' - b, not b - a'.
To handle that, we need a conditional swap of four pairs of numbers
(three pairs in the single-limb loop). And it has to be rare enough to
not increase the number of iterations. It's not obvious to me that this
can be faster than doing a quotient adjustment in the current iteration.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.