fast arithmetic with little extra memory

Thu Oct 15 21:35:52 CEST 2009

Paul Zimmermann <Paul.Zimmermann at loria.fr> writes:

> Daniel S. Roche has designed a clever variant of Karatsuba's algorithm
> which requires only O(log n) extra memory (in addition to the read-only
> n+n input and to the read/write 2n output).

Fun! It seems new useful arithmetic building blocks are discovered all
the time. The building block here is a function that computes the 2n
limb value h + (f_0 + f_1) g, where h, f_0, f_1 and g are n limb
values, and h is passed in the result area. Fits quite well with mpn
primitives, basecase would be

mp_limb_t
add_mul_add_n (mp_ptr rp, mp_srcptr f0p, mp_srcptr f1p, mp_srcptr g, mp_size_t n)
{
  mp_size_t i;
  mp_limb_t c;

  for (i = 0, c = 0; i < n; i++)
    {
      mp_limb_t f0, f1;

      /* Is there any simpler way to do add with carry in and out? We compute

           [c, f0] = f0p[i] + f1p[i] + c
       */
      f0 = f0p[i] + c;
      c &= (f0 == 0);
      f1 = f1p[i];
      f0 += f1;
      c += (f0 < f1);

      rp[n+i] = mpn_addmul_1 (rp + i, g, n, f0);
    }
  if (c)
    c = mpn_add_n (rp + n, rp + n, g, n);

  return c;
}

Questions:

  1. How to handle carries and other details for the integer case,
     unbalanced inputs, etc. We may want to compute h + (f0 - f1) * g
     instead, which most likely leads to some new complications.

  2. Is memory usage for Karatsuba important for GMP performance? The
     paper quotes Brent and Zimmermann, "The efficiency of an
     implementation of Karatsuba's algorithm depends heavily on memory
     usage".

  3. How many recursive Karatsuba calls do we typically have between
     toom3 and basecase? The new algorithm costs additional additions
     for deep recursion; the first two levels need 7 additions of size
     n/2, just like plain Karatsuba, but some of the calls deeper down
     needs 9 additions.

  4. What about the basecase, can the above primitive be competitive
     with plain mul_basecase? I wouldn't be too surprised if a
     well-written assembler implementation can run at the same speed,
     even though it does additional work. (There will also be a need
     for a simpler basecase function that computes h + f * g, with h
     located in the result area, i.e. an addmul_n, and of course for
     plain multiplication).

  5. Can the tricks be extended to higher toom functions?

Regards,
/Niels