Niels Möller nisse at lysator.liu.se
Tue Feb 21 21:52:36 CET 2012

I wrote:

>>   2. One could do the shifting differently, applying the shift to the limb

Torbjörn:

>> And then handle carry-out from the last shift a a conditional add_n.

I:

> It's not that bad. In this scheme, up[i] (shifted) is multiplied by the
> n-1-i limbs up[i+1, ..., n-1], i.e., fewer as i increases. The final
> off-diagonal iteration (i = n-2) then adds up[n-2] * up[n-1], so if we
> shift up[n-2], we only need a conditional add of the single limb
> up[n-1].

It turns out there are a more conditional adds than I thought. When we
call addmul_1 with a shifted limb, the next addmul_1 is shorter. So
adding the carry to the next addmul_1 call is *almost* enough, but we
also have to apply it to the limb not included in that next addmul_1
call. It boils down to a conditional add to the next diagonal product.

Implementation below. It's quite nice and simple, a single loop which in
each iteration computes one diagonal term, calls addmul_1c for
off-diagonal terms (that's the bulk of the work, naturally), and then a
bit of extra fiddling to get the shifting correct. If this algorithm can
be faster than the current mpn_addmul_1-based code is less clear. It's
work done for on-the-fly shifting.

One could hope that the additional shift and carry logic can be
scheduled in parallel with the multiplication work in (the previous)
throughput, carry propagation latency, or decoder bandwidth?

Regards,
/Niels

#define sqr_2(r3, r2, r1, r0, u1, u0) do {				\
mp_limb_t __p1, __p0, __t, __cy, __u1p;				\
umul_ppmm (__t, (r0), (u0), (u0));					\
umul_ppmm (__p1, __p0, (u0) << 1, (u1));				\
__cy = (u0) >> (GMP_LIMB_BITS - 1);					\
add_ssaaaa (__t, (r1), __p1, __p0, 0, __t);				\
__u1p = (u1) + __cy;						\
__cy = __u1p < __cy;						\
umul_ppmm (__p1, __p0, (u1), __u1p);				\
add_ssaaaa ((r3), (r2), __p1, __p0, -__cy, __t);			\
} while (0)

/* Squaring with on-the-fly shifting for the off-diagonal elements.
Let

uc[i] = up[i] >> (GMP_LIMB_BITS - 1)
us = 2 up mod B,
us[i] = 2 up[i] mod B + uc[i-1], for i > 0

In the first iteration, we compute

up * up + B us * <up, up, ..., up[n-1]>

In iteration i, we add in

B^{2i} (up[i] * (up[i] + uc[i-1]) + B us[i] * <up[i+1], ..., up[n-1]>)

We have an unlikely carry from the addition up[i] + uc[i-1]. The
current handling is a bit tricky. A simpler alternative is to
compute the product up[i]^2, and conditionally add in up[i] to the
result. */

void
sqr_basecase_1 (mp_ptr rp, mp_srcptr up, mp_size_t n)
{
mp_size_t i;
mp_limb_t ul, ulp, uh, p2, p1, p0, c1, c0, t;

if (n == 1)
{
mp_limb_t ul = up;
umul_ppmm (rp, rp, ul, ul);
return;
}
else if (n == 2)
{
mp_limb_t u0, u1;
u0 = up;
u1 = up;
sqr_2 (rp, rp, rp, rp, u1, u0);
return;
}

ul = up;
umul_ppmm (p1, rp, ul, ul);
rp[n] = mpn_mul_1c (rp+1, up+1, n-1, ul<<1, p1);

for (i = 1; i < n-2; i++)
{
c0 = ul >> (GMP_LIMB_BITS - 1);
ul = up[i];
ulp = ul + c0;
c1 = ulp < c0;
umul_ppmm (p1, p0, ul, ulp);
add_ssaaaa (p1, rp[2*i], p1, p0, -c1, rp[2*i]);

rp[n+i] = mpn_addmul_1c (rp + 2*i+1, up + i + 1, n - i - 1,
(ul << 1) + c0, p1);
}
/* Handle i = n-2 */;
c0 = ul >> (GMP_LIMB_BITS - 1);
ul = up[n-2];
ulp = ul + c0;
c1 = ulp < c0;
umul_ppmm (p1, p0, ul, ulp);
add_ssaaaa (p1, rp[2*n-4], p1, p0, -c1, rp[2*n-4]);

uh = up[n-1];
umul_ppmm (p2, p0, (ul << 1) + c0, uh);
add_ssaaaa (p2, rp[2*n-3], p2, p0, c0, t);

/* Handle i = n-1 */
c0 = ul >> (GMP_LIMB_BITS - 1);
ulp = uh + c0;
c1 = ulp < c0;
umul_ppmm (p1, p0, uh, ulp);
add_ssaaaa (rp[2*n-1], rp[2*n-2], p1, p0, -c1, p2);
}

--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.