div_qr_1n_pi1

Niels Möller nisse at lysator.liu.se
Thu Jun 3 20:30:25 UTC 2021


nisse at lysator.liu.se (Niels Möller) writes:

> The current (METHOD == 2) code some somethign similar, it conditionally
> adds in B2 at the right place in the next iteration. It uses the
> name u2 for the carry that lives unti the next iteration,
>
>       umul_ppmm (p1, p0, u1, B2);
>       ADDC_LIMB (cy, u0, u0, u2 & B2);
>       u0 -= (-cy) & d;
>       add_mssaaaa (u2, u1, u0, u0, up[j], p1, p0);

[...]

> The drawback, in the context of div_qr_1n_pi1, is that the u2 input to
> the iteration makes the quotient production much more costly.

And the cost is basically an

   add_ssaaaa (q2, q1, -u2, u2 & dinv, CNST_LIMB(0), u1);

so at least a sequence and, add, adc.

I think I've found a way to get rid of that, at the cost of more
side-channel leakage, by adding it into the multiplication by previous
u1 (which, unfortunately, may give another carry out). The iteration
would become something like
   
   /* u recurrence, via u2, u1, u0 variables. */
   umul_ppmm (p1, p0, u1, B2);
   ADDC_LIMB (cy, u0, u0, u2 & B2);
   u0 -= (-cy) & d;
   q_u1 = u1; /* Save u1, for below quotient processing */
   add_mssaaaa (u2, u1, u0, u0, up[j], p1, p0);

   /* q recurrence, via q0 variable. Input is the sum of "old" u1 and
     "new" u2. */
   q_u1 -= u2;
   if (UNLIKELY(q_u1 == 0) && u2) 
     {
       /* The increment of q_u1 overflowed, handle specially. */
       q0 += cy;
       add_ssaaaa (q1, q2, CNST_LIMB(1), dinv, q0 < cy, q0);
       q0 = 0;
     } 
   else
     {
       umul_ppmm (p1, p0, q_u1, dinv);
       ADDC_LIMB (q_2, q1, q_u1, q0);
       add_ssaaaa (q_2, q_1, q_2, q_1, CNST_LIMB(0), p1 + cy);
       q0 = p0;
     }
   ... store q2 and q1 ...
    
The second multiplication now depends on the first, so may put some more
pressure on out-of-order execution. 

> It would be nice if we could stay with this recurrency for u, but keep
> the other way of carry handling for q. 

I've given it a quick try, so far without success.

I might play a bit more with microoptimizations of the current algorithm
before doing something more sophisticated. You're idea of conditonally
adding the invariant d * B2 at the right place is also interesting, I
don't think I understood it correctly at first reading.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.


More information about the gmp-devel mailing list