udiv_qr_3by2 vs divappr

Mon Aug 27 19:46:30 UTC 2018

paul zimmermann <Paul.Zimmermann at inria.fr> writes:

> page 1: the division instruction is now much faster than before on modern
>      processors

According to https://gmplib.org/~tege/x86-timing.pdf, they're still an
order of magnitute slower than multiplication. E.g 86 vs 3 cycles on
Intel skylake. And in addition, multiplication is piplined, division
likely isn't.

Do you have any suggestion for better wording than "vastly slower" ?

> page 2: sigificant -> significant

Fixed.

> page 3, line 2 of Section 3, remove "The output"

Fixed.

> page 3, line 5 of Section 3: <d_1, d_0 0> -> <d_1, d_0, 0>

I think <d_1, d_0> is right.

>         line 6: correesponds -> corresponds

Fixed.

> page 3, The algorithm: <d_1, d_0 0> -> <d_1, d_0, 0>

Should be <d_1, d_0>, as above.

>         step 3: u_2 -> u_1, u_1 -> u_0

Fixed.

> page 4, line 5: d_1 - d_0 -> d_0 - d_1

Fixed.

> page 4, line 7: inputes -> inputs

Fixed.

> page 4, line 13: the the -> the

Fixed.

> page 4, line 20: u_1 -> u_0

Fixed.

> page 4, line 21: The last terms -> term

Fixed.

> page 4, line 27: for the upper bound, I agree you can subtract the maximal
> value of the u_0 term in the proof of Theorem 3 from [4], but this gives
> \beta-1, not \beta.

This is a bit subtle, maybe I should explicitly repeat the proof form
[4] with needed modifications. My argument is that the proof in [4] is
of the form

  \tilde r = ... + u0 < ... + \beta <= c = max (\beta^2 - D, q_0 \beta)

But if u_0 < \beta is the only thing that lets us conclude that \tilde r
< c, with strict inequality, then there's an off-by-one error here.

Now there are multiple bounds involved; to get \tilde r = c we'd also
need something like u_1, u_0 and q_0 maximal and d_0 = 0.

> page 4: "it follows that r_1 + d_1 + 1 > \beta". I do not agree, take for
> example u_1 = u_0 = 0, d_1 = 128, d_0 = 1 with \beta = 256. You get r_1 = 127,
> and r_1 + d_1 + 1 = 256 = \beta. But r_1 + d_1 + 1 >= \beta is enough for the
> rest of the proof.

Good catch, your example seems to be correct.

> page 4, line -2: on line 8 -> on line 9

Fixed.

> page 5, line 6: \beta r_1 + d_0 -> \beta r'_1 + d_0

Fixed.

> For the case R' < 0, you did not prove the lower bound for r'_1 <= d_1 - 2,
> and did not prove the upper bound for r'_1 >= d_1 - 1.

I probably thought they were "trivial", but I have to double check, and
say something about it.

> Case R' >= 0: if the -\beta term is replaced by -(\beta-1), then all I can get
> is r_1 <= \beta - d_1 - 1.

That is problematic, for this case, we really need that r_1 + (d_1 + 1)
doesn't overflow. I think we need to strengthen the proof to get a
-\beta term, and possibly single out any problematic special cases.

> I wonder whether we can get rid of that term, or always subtract -1.
> In the latter case (assuming the proof with the p_0 > 0 extra term is correct),
> the only problem would be when p_0 = 0.

I have looked at that, so far without success. If I remember correctly,
always subtracting -1 can be treated in the analysis as extending the
single-word range for r_0 to 0 <= r_0 <= \beta. But that breaks the
analysis of the  R' < 0 case. We have

  {r_1, r_0}  >=  q_0 \beta

We want to conclude r_1 >= q_0, but that no longer holds if we allow r_0
= \beta. To make it work, we'd need to streighten the lower bound 

  R' >= q_0 \beta - \beta^2

Hmm... but from another look at the proof from [4], seems to give us
strict inequality

  R' > q_0 \beta - \beta^2

with no extra effort. So maybe we can do that.

So we really should redo the proof from [4] in the current setting, and
prove the bounds we need. Which seems to be

  R' < max (\beta^2 - D, q_0 \beta) - \beta
  R' >= - D
  R' > q_0 \beta - \beta^2

While the current arguments don't rule out equality.

Thanks for the careful review! Updated version (with all the things that
were easy to fix) at
https://www.lysator.liu.se/~nisse/misc/schoolbook-divappr.pdf

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.