fast inversion

Sun Apr 26 20:53:49 UTC 2015

Ciao Paul!

Sorry for the long delay, I've been off-line for a long while...

Il Mar, 3 Febbraio 2015 7:09 pm, paul zimmermann ha scritto:
> I did update my notes and code about fast inversion (which were used as
> basis for Algorithm 3.5 ApproximateReciprocal in "Modern Computer

Great news!

> The main change is that now the result is uniquely defined

> Comments are welcome.

After a first glance to the code, two lines surprise me:
      mpn_com_n (tp, tp, n);
      mpn_add_1 (tp, tp, n, ONE);
I wondered why you didn't use
      mpn_neg_n (tp, tp, n);
Then I tested (on shell at gmplib) and...

@shell ~/gmp-repo$ tune/speed -s 1-1030 -f 2 -c mpn_neg mpn_com
mpn_add_1_inplace.1
overhead 6.78 cycles, precision 10000 units of 2.86e-10 secs, CPU freq
3500.08 MHz
              mpn_neg       mpn_com mpn_add_1_inplace.1
1               #5.68         12.54          6.80
2                9.40         13.65         #8.19
4               16.25         11.40         #8.22
8               31.56         16.01         #6.84
16              61.86         25.10         #8.16
32             139.01         44.79         #6.80
64             248.18         85.51         #8.20
128            472.77        206.21         #8.38
256            918.75        372.29         #8.21
512           1915.83        731.53         #6.87
1024          3689.67       1472.14         #8.29

...of course you are right. On many architectures we HAVE_NATIVE_mpn_com,
which is faster than the C loop when sizes grow.

On the note side:

The mpn_invert function in GMP currently uses the ApproximateReciprocal
with a final check-and-possibly-correct step triggered by a check of a
single limb. The main difference with respect to your note is that this
step is not performed for all recursion levels, but only once.

We have two functions: mpn_invert calls mpn_invertappr, the latter returns
a boolean; with your code, something like
    return (up[2*h-l-1] + 3 <= CNST_LIMB(2)); /* X might be off by 1 */
And the former perform the final check if the return value is true.

The other difference is that you suggest to use plain multiplication or
the product modulo B^nm+1. On our side we use multiplication modulo B^nm
(to be honest we only truncate linear operations) or wraparound modulo
B^nm-1. I believe the results are the same obtained by the steps you
suggest. And I probably should change the CNST_LIMB(7) used in the last
check with some stricter value.

We should try if tune/speed is able to detect some speed difference
between the two implementations, probably not...

Best regards,
Marco

-- 
http://bodrato.it/