nisse at lysator.liu.se
Mon Sep 16 07:02:58 UTC 2019
tg at gmplib.org (Torbjörn Granlund) writes:
> Not exactly. I think running div1 like hgcd2 but without any of hgcd2
> bookkeeping would make some sense. I.e., feed div1 with the input of
> Euclid's algorithm. To avoid skew from particular operands, perhaps
> table 10 uniformly distrinuted random numbers, then do all (a,b) pairs
> from that table.
> The idea is to make measurements more accurate.
Makes sense. If better accuracy is needed.
> I've written several div1 in asm (arm v5 method 2, 64-bit arm v8 method
> 2 and 3, 64-bit x86 method 2).
> Since these will not be static or inlined, I call them mpn_div_11. I
> am not sure how to time them within your framework.
I think adding a case
#define div1 mpn_div_11
... other versions depending on HGCD2_DIV1_METHOD
in hgcd2.c, and then using ./tune/speed mpn_hgcd2, should work. To not
break tuneup of the C versions, one would need to amend that to
#if HAVE_NATIVE_MPN_DIV_11 && !TUNING_BUILD
and add a #define TUNING_BUILD 1 to the various tune/hgcd2_*.c files.
(Or maybe some other define than TUNING_BUILD, since this is meant to
affect both the tuneup and the speed programs).
> I now believe a table-based thing which gets the quotient right > 90% of
> the time will be the best approach.
What should the table look like? If one looks up only the divisor bits,
(i.e., a reciprocal table) one gets better precision with smaller table
size. If one looks up bits of both inputs, table gets larger. One would
then tabulate something like 1uuuuu / 1.ddddd, and shift the result
depending on the bit size difference.
> And perhaps hgcd2 could be made to cope with inaccurate quotients, and
> that without itself executing unpredictable branches? As long as
> progress is made, perhaps sloppy quotients can be tolerated?
After a' = a - q*b, we currently expect 0 <= a' < b (and exit if a' is
close to zero). Allowing a' < 0 is not an easy change; one needs an
absolute value, and maybe a way to deal with negative matrix elements.
Allowing a' > b means that next step needs another a' - b, not b - a'.
To handle that, we need a conditional swap of four pairs of numbers
(three pairs in the single-limb loop). And it has to be rare enough to
not increase the number of iterations. It's not obvious to me that this
can be faster than doing a quotient adjustment in the current iteration.
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel