About mpn_sqr_basecase

Tue Feb 21 21:52:36 CET 2012

I wrote:

>>   2. One could do the shifting differently, applying the shift to the limb
>>      argument of addmul_1.

Torbjörn:

>> And then handle carry-out from the last shift a a conditional add_n.

I:

> It's not that bad. In this scheme, up[i] (shifted) is multiplied by the
> n-1-i limbs up[i+1, ..., n-1], i.e., fewer as i increases. The final
> off-diagonal iteration (i = n-2) then adds up[n-2] * up[n-1], so if we
> shift up[n-2], we only need a conditional add of the single limb
> up[n-1].

It turns out there are a more conditional adds than I thought. When we
call addmul_1 with a shifted limb, the next addmul_1 is shorter. So
adding the carry to the next addmul_1 call is *almost* enough, but we
also have to apply it to the limb not included in that next addmul_1
call. It boils down to a conditional add to the next diagonal product.

Implementation below. It's quite nice and simple, a single loop which in
each iteration computes one diagonal term, calls addmul_1c for
off-diagonal terms (that's the bulk of the work, naturally), and then a
bit of extra fiddling to get the shifting correct. If this algorithm can
be faster than the current mpn_addmul_1-based code is less clear. It's
the O(n) work of MPN_SQR_DIAG_ADDLSH1 compared to the additional O(n)
work done for on-the-fly shifting.

One could hope that the additional shift and carry logic can be
scheduled in parallel with the multiplication work in (the previous)
call of addmul_1c. What's the bottleneck for addmul_1, is it multiplier
throughput, carry propagation latency, or decoder bandwidth?

Regards,
/Niels

#define sqr_2(r3, r2, r1, r0, u1, u0) do {				\
    mp_limb_t __p1, __p0, __t, __cy, __u1p;				\
    umul_ppmm (__t, (r0), (u0), (u0));					\
    umul_ppmm (__p1, __p0, (u0) << 1, (u1));				\
    __cy = (u0) >> (GMP_LIMB_BITS - 1);					\
    add_ssaaaa (__t, (r1), __p1, __p0, 0, __t);				\
    __u1p = (u1) + __cy;						\
    __cy = __u1p < __cy;						\
    umul_ppmm (__p1, __p0, (u1), __u1p);				\
    add_ssaaaa ((r3), (r2), __p1, __p0, -__cy, __t);			\
  } while (0)

/* Squaring with on-the-fly shifting for the off-diagonal elements.
   Let

     uc[i] = up[i] >> (GMP_LIMB_BITS - 1)
     us[0] = 2 up[0] mod B,
     us[i] = 2 up[i] mod B + uc[i-1], for i > 0

   In the first iteration, we compute

     up[0] * up[0] + B us[0] * <up[1], up[2], ..., up[n-1]>

   In iteration i, we add in

     B^{2i} (up[i] * (up[i] + uc[i-1]) + B us[i] * <up[i+1], ..., up[n-1]>)

   We have an unlikely carry from the addition up[i] + uc[i-1]. The
   current handling is a bit tricky. A simpler alternative is to
   compute the product up[i]^2, and conditionally add in up[i] to the
   result. */

void
sqr_basecase_1 (mp_ptr rp, mp_srcptr up, mp_size_t n)
{
  mp_size_t i;
  mp_limb_t ul, ulp, uh, p2, p1, p0, c1, c0, t;

  if (n == 1)
    {
      mp_limb_t ul = up[0];
      umul_ppmm (rp[1], rp[0], ul, ul);
      return;
    }
  else if (n == 2)
    {
      mp_limb_t u0, u1;
      u0 = up[0];
      u1 = up[1];
      sqr_2 (rp[3], rp[2], rp[1], rp[0], u1, u0);
      return;
    }

  ul = up[0];
  umul_ppmm (p1, rp[0], ul, ul);
  rp[n] = mpn_mul_1c (rp+1, up+1, n-1, ul<<1, p1);

  for (i = 1; i < n-2; i++)
    {
      c0 = ul >> (GMP_LIMB_BITS - 1);
      ul = up[i];
      ulp = ul + c0;
      c1 = ulp < c0;
      umul_ppmm (p1, p0, ul, ulp);
      add_ssaaaa (p1, rp[2*i], p1, p0, -c1, rp[2*i]);

      rp[n+i] = mpn_addmul_1c (rp + 2*i+1, up + i + 1, n - i - 1,
			       (ul << 1) + c0, p1);
    }
  /* Handle i = n-2 */;
  c0 = ul >> (GMP_LIMB_BITS - 1);
  ul = up[n-2];
  ulp = ul + c0;
  c1 = ulp < c0;
  umul_ppmm (p1, p0, ul, ulp);
  add_ssaaaa (p1, rp[2*n-4], p1, p0, -c1, rp[2*n-4]);

  uh = up[n-1];
  umul_ppmm (p2, p0, (ul << 1) + c0, uh);
  ADDC_LIMB (c0, t, p1, rp[2*n-3]);
  add_ssaaaa (p2, rp[2*n-3], p2, p0, c0, t);

  /* Handle i = n-1 */
  c0 = ul >> (GMP_LIMB_BITS - 1);
  ulp = uh + c0;
  c1 = ulp < c0;
  umul_ppmm (p1, p0, uh, ulp);
  add_ssaaaa (rp[2*n-1], rp[2*n-2], p1, p0, -c1, p2);
}

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.