# About mpn_sqr_basecase

Niels Möller nisse at lysator.liu.se
Tue Feb 21 21:52:36 CET 2012

```I wrote:

>>   2. One could do the shifting differently, applying the shift to the limb
>>      argument of addmul_1.

Torbjörn:

>> And then handle carry-out from the last shift a a conditional add_n.

I:

> It's not that bad. In this scheme, up[i] (shifted) is multiplied by the
> n-1-i limbs up[i+1, ..., n-1], i.e., fewer as i increases. The final
> off-diagonal iteration (i = n-2) then adds up[n-2] * up[n-1], so if we
> shift up[n-2], we only need a conditional add of the single limb
> up[n-1].

It turns out there are a more conditional adds than I thought. When we
call addmul_1 with a shifted limb, the next addmul_1 is shorter. So
adding the carry to the next addmul_1 call is *almost* enough, but we
also have to apply it to the limb not included in that next addmul_1
call. It boils down to a conditional add to the next diagonal product.

Implementation below. It's quite nice and simple, a single loop which in
each iteration computes one diagonal term, calls addmul_1c for
off-diagonal terms (that's the bulk of the work, naturally), and then a
bit of extra fiddling to get the shifting correct. If this algorithm can
be faster than the current mpn_addmul_1-based code is less clear. It's
the O(n) work of MPN_SQR_DIAG_ADDLSH1 compared to the additional O(n)
work done for on-the-fly shifting.

One could hope that the additional shift and carry logic can be
scheduled in parallel with the multiplication work in (the previous)
call of addmul_1c. What's the bottleneck for addmul_1, is it multiplier
throughput, carry propagation latency, or decoder bandwidth?

Regards,
/Niels

#define sqr_2(r3, r2, r1, r0, u1, u0) do {				\
mp_limb_t __p1, __p0, __t, __cy, __u1p;				\
umul_ppmm (__t, (r0), (u0), (u0));					\
umul_ppmm (__p1, __p0, (u0) << 1, (u1));				\
__cy = (u0) >> (GMP_LIMB_BITS - 1);					\
add_ssaaaa (__t, (r1), __p1, __p0, 0, __t);				\
__u1p = (u1) + __cy;						\
__cy = __u1p < __cy;						\
umul_ppmm (__p1, __p0, (u1), __u1p);				\
add_ssaaaa ((r3), (r2), __p1, __p0, -__cy, __t);			\
} while (0)

/* Squaring with on-the-fly shifting for the off-diagonal elements.
Let

uc[i] = up[i] >> (GMP_LIMB_BITS - 1)
us[0] = 2 up[0] mod B,
us[i] = 2 up[i] mod B + uc[i-1], for i > 0

In the first iteration, we compute

up[0] * up[0] + B us[0] * <up[1], up[2], ..., up[n-1]>

In iteration i, we add in

B^{2i} (up[i] * (up[i] + uc[i-1]) + B us[i] * <up[i+1], ..., up[n-1]>)

We have an unlikely carry from the addition up[i] + uc[i-1]. The
current handling is a bit tricky. A simpler alternative is to
compute the product up[i]^2, and conditionally add in up[i] to the
result. */

void
sqr_basecase_1 (mp_ptr rp, mp_srcptr up, mp_size_t n)
{
mp_size_t i;
mp_limb_t ul, ulp, uh, p2, p1, p0, c1, c0, t;

if (n == 1)
{
mp_limb_t ul = up[0];
umul_ppmm (rp[1], rp[0], ul, ul);
return;
}
else if (n == 2)
{
mp_limb_t u0, u1;
u0 = up[0];
u1 = up[1];
sqr_2 (rp[3], rp[2], rp[1], rp[0], u1, u0);
return;
}

ul = up[0];
umul_ppmm (p1, rp[0], ul, ul);
rp[n] = mpn_mul_1c (rp+1, up+1, n-1, ul<<1, p1);

for (i = 1; i < n-2; i++)
{
c0 = ul >> (GMP_LIMB_BITS - 1);
ul = up[i];
ulp = ul + c0;
c1 = ulp < c0;
umul_ppmm (p1, p0, ul, ulp);
add_ssaaaa (p1, rp[2*i], p1, p0, -c1, rp[2*i]);

rp[n+i] = mpn_addmul_1c (rp + 2*i+1, up + i + 1, n - i - 1,
(ul << 1) + c0, p1);
}
/* Handle i = n-2 */;
c0 = ul >> (GMP_LIMB_BITS - 1);
ul = up[n-2];
ulp = ul + c0;
c1 = ulp < c0;
umul_ppmm (p1, p0, ul, ulp);
add_ssaaaa (p1, rp[2*n-4], p1, p0, -c1, rp[2*n-4]);

uh = up[n-1];
umul_ppmm (p2, p0, (ul << 1) + c0, uh);
ADDC_LIMB (c0, t, p1, rp[2*n-3]);
add_ssaaaa (p2, rp[2*n-3], p2, p0, c0, t);

/* Handle i = n-1 */
c0 = ul >> (GMP_LIMB_BITS - 1);
ulp = uh + c0;
c1 = ulp < c0;
umul_ppmm (p1, p0, uh, ulp);
add_ssaaaa (rp[2*n-1], rp[2*n-2], p1, p0, -c1, p2);
}

--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
```

More information about the gmp-devel mailing list