New mpn_divexact code

Thu Aug 3 12:04:45 CEST 2006

>       mpn_mul (tp, dp, dn, qp, in);
> 
> Here, {qp,in} is much smaller than {dp,dn} and your FFT code cleverly
> transforms {qp,in} only once.

Is that already in GMP?

> Further speedups should be possible if {ip,in} where pre-transformed,
> and then used in the mpn_mullow_n calls.  Perhaps pre-transformation
> would be possible also for the comparatively large {dp,dn}?

Yes of course.

>   When the quotient size (qn) is smaller than the divisor size (dn),
>   it suffices to consider dn limbs of the dividend and divisor (either
>   low or high bits, or dn/2 low bits and dn/2 high bits), thus a cutoff
>   point is enough.
>   
> I think you mean qn instead of dn here the last three times...

Right.

>   The tricky thing here is to glue the high and low parts of the quotient,
>   and ensure the middle bits are correct.
> 
> I can imagine.  Are you aware of any free implementation, or perhaps
> pseudo code, that one could start from?

Here is something I wrote for the book I am writing with Richard Brent.
Basically it says that an overlap of 2 bits is enough. Comments are welcome.

   \begin{exercise} \label{exo_exact_div}
   Design an algorithm that performs an exact division of a $4n$-bit integer by a
   $2n$-bit integer, with a quotient of $2n$ bits, using the idea from the last
   paragraph of \textsection{exactdiv}. Prove that your algorithm is correct.
   \end{exercise}
   \answer{
   Let $A$ be the divisor, $B$ the dividend, and $Q$ the quotient.
   Let $\beta=2^n$, and $Q = Q_1 \beta + Q_0$.
   $Q_0$ is given by $Q_0 = A_0/B_0 (\bmod \beta)$, where $A_0 = A (\bmod \beta)$,
   and $B_0 = B (\bmod \beta)$.
   Now let $Q'_1$ be the rounding to $-\infty$ of $A/B$ to $n$ bits
   (using the algorithms of Chapter~\ref{float}).
   Since $Q$ is assumed to have $2n$ bits, we have $2^{2n-1} \leq A/B < 2^{2n}$,
   thus $2^{2n-1} \leq Q'_1 < 2^{2n}$.
   Since $Q'_1$ has (at most)
   $n$ significant bits, necessarily $Q'_1$ can be written $R 2^m$.
   It is easy to prove that $R=Q_1$.

   The above solution is cheating, since determining the correct rounding
   of $A/B$ to $n$ bits may need to consider all the bits of $A$ and $B$.
   Now assume we divide the upper $2n$ bits of $A$ by the $n$ upper bits of
   $B$: $A_3 \beta + A_2 = Q'_1 B_1 + R$.
   By an analysis similar to that of Theorem~\ref{theoBasecaseDivRem},
   we can prove that $Q'_1$ is never too small, and can be at most too large
   by $2$. Thus $Q'_1 - 2 \leq Q_1 \leq Q'_1$. Thus computing two more
   bits of $Q_0$ --- i.e.~$n+2$ bits instead of $n$ --- will be enough to
   determine the value of $Q_1$, since possible candidates will differ by $4$,
   thus only one candidate lies in $[Q'_1 - 2, Q'_1]$.
   }

> How much time does Jebelean's method need in terms of M(n),
> compared to the traditional Newton inversion?

If D(qn) is the time needed to divide exactly the least qn words,
Jebelean's method needs 2*D(qn/2). This should save a factor of 2
for the basecase, and asymptotically nothing (but the memory usage
will decrease).

> My Newton inversion code could surely be improved without involving
> Jebelean's method.  There are comments in the file (see
> http://swox.com/gmp/devel/mpn/generic/invert_2exp.c) from you ("Paul
> Zimmermann spoke"), and comments by me about using short
> multiplication (mpn_sqrhigh_mn, mpn_mulmid_mn).

Yes we have to try this. Maybe it will save nothing.

Paul