fixed-size mpn_mul_n for small n?

Mon Feb 13 13:47:31 CET 2012

       Niels,

> I think there's some potential for speed up of the linear term, which is
> mostly relevant for small sizes. The addmul_1 calls can run at 3 cycles
> per limb or so. But then the computing the quotient involves dependent
> multiplications with longer latency, so one may be able to compute the
> independent left and right quotient in less time than computing two
> quotients at the same end. Unclear to me if that's going to make any
> difference in real code, in particular since then left-to-right quotient
> will require some kind of adjustment step.

agreed. But to avoid this dependent multiplication, I believe
Montgomery-Svoboda has more potential. Instead of performing n addmul_1 calls
with N, you perform n-1 call with k*N such that the low limb of k*N is -1
(thus the quotient selection is trivial) and one call with N. Here is a
reference C implementation (not tested), where {sp, nn} = (k*N+1)/B:

static void
ecm_redc_1_svoboda (mp_ptr rp, mp_ptr tmp, mp_srcptr np, mp_size_t nn,
                    mp_limb_t invm, mp_srcptr sp)
{
  mp_size_t j;
  mp_limb_t t0, cy;

  /* instead of adding {np, nn} * (invm * tmp[0] mod B), we add
     {sp, nn} * tmp[0], where {np, nn} * invm = B * {sp, nn} - 1 */
  for (j = 0; j < nn - 1; j++, tmp++)
    rp[j + 1] = mpn_addmul_1 (tmp + 1, sp, nn, tmp[0]);
  /* for the last step, we reduce with {np, nn} */
  t0 = mpn_addmul_1 (tmp, np, nn, tmp[0] * invm);
  tmp ++;

  rp[0] = tmp[0];
  cy = mpn_add_n (rp + 1, rp + 1, tmp + 1, nn - 1);
  rp[nn-1] += t0;
  cy += rp[nn-1] < t0;
  if (cy != 0)
    mpn_sub_n (rp, rp, np, nn); /* a borrow should always occur here */
}

Of course the same idea could be applied to redc_2.

> What sizes are important?
> 
> I have started to look a little into elliptic curve cryptography, and
> there the sizes are pretty small. E.g., Using the standard curve over a
> 256-bit prime means that numbers are just four limbs on a 64-bit
> machine. So in this case, I'd expect a specialized a completely unrolled
> squaring function for this size could make a real difference.

for GMP-ECM, the most important range is from 10 to 20 limbs, i.e., about 200
to 400 decimal digits.

Paul