Paul Zimmermann Paul.Zimmermann at loria.fr
Mon Dec 7 14:31:56 CET 2009

```       Niels,

> From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
> Date: Mon, 07 Dec 2009 12:35:20 +0100
>
> bodrato at mail.dm.unipi.it writes:
>
> >> Here A_k is the most significant part of the input, of size appropriate
> >> for the iteration. A is assumed to be "strictly" normalized with high
> >> limb >= B/2, to relax that I guess one would need one more limb in A.
> >
> > One more limb in the result, yes... because now a virtual limb containing
> > one is assumed, but inverses of numbers smaller than one half needs
> > inverses bigger than one-point-<whatever> (the function actually computes
> > <whatever>).
>
> What I was trying to say is that if we use the less strict normalization
> requirement that the high limb of A is non-zero, then I suspect we need
> A_k (the truncated input used in iteration k) to include one more limb
> than in the algorithm described in MCA.

yes without looking into details I would say that in the worst case you lose
63 bits (on a 64-bit machine) thus less than one word.

> > The "adjustment step" I'm using now... is needed only for the last
> > iteration (the intermediate ones are adjusted by the next iteration), but
> > it is _not_ cheap.
>
> From the work on single-limb inverses (invert_limb, see Alg. 2 and 3 in
> http://www.lysator.liu.se/~nisse/archive/draft-division-paper.pdf), I
> suspect that it is very difficult to get an "exact" reciprocal
> floor((B^{2n} - 1) / X) without an expensive multiplication at the end.

I support this claim, since I remember I already had lots of difficulties to
get the bound of Lemma 3.4.2. By the way do you know the paper by Cornea and
Markstein "Integer Divide and Remainder Operations in the Intel IA-64
Architecture", published in the proceedings of RNC'4? It might be relevant