div_qr_1n_pi1

Niels Möller nisse at lysator.liu.se
Thu Jun 3 15:32:53 UTC 2021

```Torbjörn Granlund <tg at gmplib.org> writes:

> nisse at lysator.liu.se (Niels Möller) writes:
>
>   The critical path, via the u1 variable, is
>
>         umul_ppmm (p1, p0, u1, B2);
>         add_mssaaaa (cy, u1, u0, u0, up[j], p1, p0);
>         u1 -= cy & d;
>
> Nice!
>
> The (cy & d) term is multiplied in the next iteration by B2, i.e., we
> have either the invariant d * B2 or 0 as the contribution to the p1,p0
> product.  If we can somehow add that to u0,up[j] during the umul_ppmm
> then we could save several more cycles.

The current (METHOD == 2) code some somethign similar, it conditionally
adds in B2 at the right place in the next iteration. It uses the
name u2 for the carry that lives unti the next iteration,

umul_ppmm (p1, p0, u1, B2);
ADDC_LIMB (cy, u0, u0, u2 & B2);
u0 -= (-cy) & d;
add_mssaaaa (u2, u1, u0, u0, up[j], p1, p0);

Then the u0 update and carry propagation becomes independent of the
multiplication. The critical u1 -> u1 path is shortened to just

INSN         CYCLE
mul         0,5

This is exactly it's done in the MOD_1_1P_METHOD == 2 code.

The drawback, in the context of div_qr_1n_pi1, is that the u2 input to
the iteration makes the quotient production much more costly. It would
be nice if we could stay with this recurrency for u, but keep the other
way of carry handling for q. Something like

umul_ppmm (p1, p0, u1, B2);
ADDC_LIMB (cy, u0, u0, u2 & B2);
u0 -= (-cy) & d;
t = u1 - (u2 & d);  /* Outside of the critical u recurrency */
add_mssaaaa (u2, u1, u0, u0, up[j], p1, p0);

umul_ppmm (p1, p0, t, dinv);
/* quotient logic... */

But it's going to be a bit subtle to do that and keep the q and u parts
consistent.

Regards,
/Niels

--
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
```