udiv_qr_3by2 vs divappr

Mon Aug 27 14:40:58 UTC 2018

       Dear Niels,

> From: nisse at lysator.liu.se (Niels Möller)
> Date: Sun, 05 Aug 2018 09:59:46 +0200
> 
> nisse at lysator.liu.se (Niels Möller) writes:
> 
> > static inline mp_limb_t
> > divappr_2 (mp_limb_t n2, mp_limb_t n1,
> > 	   mp_limb_t d1, mp_limb_t d0, mp_limb_t dinv)
> > {
> >   mp_limb_t q1, q0, r1, t1, t0, mask;
> >
> >   umul_ppmm (q1, q0, n2, dinv);
> >   add_ssaaaa (q1, q0, q1, q0, n2, n1);
> >
> >   q1++;  /* Can't overflow */
> >   umul_ppmm (t1, t0, q1, d0);  /* t0 isn't used */
> >   r1 = n1 - d1 * q1 - t1;
> 
> To be able to prove correctness, I'm afraid this needs to be
> 
>   r1 = n1 - d1 * q1 - t1 - (t0 > 0);
> 
> >   mask = - (mp_limb_t) (r1 >= q0);
> >   q1 += mask;
> >   r1 += mask & (d1 + 1);
> >   q1 += (r1 >= d1 - 1);
> >
> >   return q1;
> > }
> 
> I'm attaching a draft paper with the analysis. As for performance, I've
> measurered a few percent speedup on shell.
> 
> Regards,
> /Niels
> 
> [2:application/pdf Show Save:schoolbook-divappr.pdf (200kB)]

quite interesting! I have a few comments:

page 1: the division instruction is now much faster than before on modern
     processors

page 2: sigificant -> significant

page 3, line 2 of Section 3, remove "The output"

page 3, line 5 of Section 3: <d_1, d_0 0> -> <d_1, d_0, 0>
        line 6: correesponds -> corresponds

page 3, The algorithm: <d_1, d_0 0> -> <d_1, d_0, 0>
        step 3: u_2 -> u_1, u_1 -> u_0

page 4, line 5: d_1 - d_0 -> d_0 - d_1

page 4, line 7: inputes -> inputs

page 4, line 13: the the -> the

page 4, line 20: u_1 -> u_0

page 4, line 21: The last terms -> term

page 4, line 27: for the upper bound, I agree you can subtract the maximal
value of the u_0 term in the proof of Theorem 3 from [4], but this gives
\beta-1, not \beta.

page 4: "it follows that r_1 + d_1 + 1 > \beta". I do not agree, take for
example u_1 = u_0 = 0, d_1 = 128, d_0 = 1 with \beta = 256. You get r_1 = 127,
and r_1 + d_1 + 1 = 256 = \beta. But r_1 + d_1 + 1 >= \beta is enough for the
rest of the proof.

page 4, line -2: on line 8 -> on line 9

page 5, line 6: \beta r_1 + d_0 -> \beta r'_1 + d_0

For the case R' < 0, you did not prove the lower bound for r'_1 <= d_1 - 2,
and did not prove the upper bound for r'_1 >= d_1 - 1.

Case R' >= 0: if the -\beta term is replaced by -(\beta-1), then all I can get
is r_1 <= \beta - d_1 - 1.

Remark: on the other hand, an exhaustive search with \beta=2^k with
1 <= k <= 7 did not find any error, even without the p_0 > 0 extra term.
I wonder whether we can get rid of that term, or always subtract -1.
In the latter case (assuming the proof with the p_0 > 0 extra term is correct),
the only problem would be when p_0 = 0.

Paul