itch/scratch interface for mul_n

Wed Oct 7 13:45:03 CEST 2009

Hi,

I think the next stage in adoption of itch/scratch interfaces for the
mpn functions should be balanced multiplication, mpn_mul_n. That has
to be done, before even trying to fully do itch/scratch interfaces for
unbalanced multiplication and the advanced division algorithms.

I imagine it might be good for performance to inline the first test,
i.e., do something like

#define MPN_MUL_N_ITCH(n) \
  (BELOW_THRESHOLD (n, MUL_KARATSUBA_THRESHOLD) ? 0 : mpn_mul_n_itch(n))

although that's not so nice for binary compatibility... One might
consider some MUL_KARATSUBA_MIN_THRESHOLD which would be part of the
ABI. Or to use the above macro only for internal calls and only export
the real, non-inlined, function mpn_mul_n_itch. (Or maybe always
inline the test for the first threshold, and call either mul_basecase
or mpn_mul_n_itch + mpn_mul_n). There seems to be a lot of ways to do
things.

Anyway, when looking at mul_n.c, I have a couple of comments and
questions:

* itchifying the function should be fairly straight-forward, using the
  same sequence of size checks. I guess one should introduce some
  wrapper function with the same interface as the current mpn_mul_n,
  to be able to update callers more incrementally?

* the code uses its own karatsuba implementation, and doesn't use
  mpn_toom22_mul. Is there any real performance gain for special
  casing balanced case? Or should mpn_kara_mul_n be deleted, replacing
  all uses with calls to mpn_toom22_mul? (And for squaring,
  mpn_kara_sqr_n and mpn_toom2_sqr are two implementatiions of exactly
  the same thing, so I guess there's no question that mpn_kara_sqr_n
  should be deleted?)

  Same goes for toom-3, and there the overhead for the general
  function, comparerd to a balanced function, should be less
  significant than for Karatsuba.

* At the end of mpn_mul_n, there are some conditionals like

      else
    #if WANT_FFT || TUNE_PROGRAM_BUILD
        {
          /* The current FFT code allocates its own space.  That should probably
    	 change.  */
          mpn_mul_fft_full (p, a, n, b, n);
        }
    #else
        {
          /* Toom4 for large operands.  */
          mp_ptr ws;
          TMP_DECL;
          TMP_MARK;
          ws = TMP_BALLOC_LIMBS (mpn_toom44_mul_itch (n, n));
          mpn_toom44_mul (p, a, n, b, n, ws);
          TMP_FREE;
        }
    #endif
    }

  There's no threshold between Toom-4 and FFT, it's one or the other
  decided at compile time. Why is that? Does toom-4 never beat FFT?

* If all the algorithms called by mpn_mul_n (basecase, toom2, toom3,
  toom4, fft), allow unbalanced inputs, does it still make sense to
  keep a separate function for balanced multiplication? I think it
  still might do, because thresholds and allocation will be lot easier
  than the general unbalanced case.

Regards,
/Niels