mpn_invert implementing ApproximateReciprocal.

Mon Dec 14 15:26:37 CET 2009

Ciao,

> Marco, I would commit this code if I were you.

I prefer not to commit, when I'm aware of a test failed by the new code.
And "try mpn_invert" did not work with the approximated inversion
replacing the current one.

> Perhaps we should have separate "mpn_invertappr" (following the naming
> scheme of divappr) and mpn_invert.  But I'd suggest we just let this be
> mpn_invert for now.

I strongly prefer the "two separate" functions option, and I implemented it.
The code I'm going to commit provides, an "exact" function mpn_invert, and
a faster-but-approximated one: mpn_invertappr (with its sub-functions _ni
and _bc).

I hope the included comment is clean enough in explaining the difference.

/*
 All the three functions mpn{,_bc,_ni}_invertappr (ip, dp, n, scratch),
 take the strictly normalised value {dp,n} (i.e. most significant bit
 must be set) as an input, and compute {ip,n}: the approximate
 reciprocal of {dp,n}.

 Let e = mpn*_invertappr (ip, dp, n, scratch) be the returned value; the
 following conditions are satisfied by the output:
   0 <= e <= 1;
   {dp,n}*(B^n+{ip,n}) < B^{2n} <= {dp,n}*(B^n+{ip,n}+1+e) .
 I.e. e=0 means that the result {ip,n} equals the one given by mpn_invert.
      e=1 means that the result _may_ be one less than expected.

 The _bc version returns e=1 most of the times.
 The _ni version should return e=0 most of the time; only 1% of
 possible random input should give e=1.

 When the exact approximation is needed, i.e. e=0 in the relation above:
   {dp,n}*(B^n+{ip,n}) < B^{2n} <= {dp,n}*(B^n+{ip,n}+1) ;
 the function mpn_invert (ip, dp, n, scratch) should be used instead.
*/

One more tunable constant was required: INV_APPR_THRESHOLD. It determines
the size above which the "exact" inverse calls the approximated one and
possibly correct the result.
Moreover the INV_MULMOD_BNM1_THRESHOLD >= INV_NEWTON_THRESHOLD, relation
is not enforced any more[*], because the "exact" inverse uses Newton
iteration even above the Newton threshold.

The values suggested by "tuneup" for the constants (on shell.gmplib, a
64-bits Core i7) are:
#define INV_MULMOD_BNM1_THRESHOLD       124
#define INV_NEWTON_THRESHOLD            149
#define INV_APPR_THRESHOLD               11

This means that mulmod_bnm1 is used for operands bigger than 125 limbs,
Newton iteration is used, for the approximate inverse (where the basecase
is faster, thanks to _divappr) only for operands bigger than 150 limbs.
But the "exact" answer is computed using at least one approximate-Newton
step starting from 11 limbs only. This means, in practice, that the new
code is faster than the current one starting from 11 limbs, and keeps on
giving the correct results!

Regards,
Marco

[*] This suggests that the same issue can probably be important also in
the tuning process of the _binvert constants: BINV_NEWTON_THRESHOLD and
BINV_MULMOD_BNM1_THRESHOLD.
As far as I can see BINV_MULMOD_BNM1_THRESHOLD is always set to zero, this
means that mulmod_bnm1 is always used when Newton is used... Great! But
the problem is that the threshold for Newton was measured with the slower
non-wrapping version of Newton.
The measured BINV_NEWTON_THRESHOLD is probably too high because of this.

-- 
http://bodrato.it/papers/