mpn_mul is embarrassingly slow

Sun Apr 22 19:09:34 UTC 2018

"Marco Bodrato" <bodrato at mail.dm.unipi.it> writes:

  Interesting, but writing an assembly mpn_mul means writing manyversions...

The idea is not to "write mpn_mul in assembly" in the sense to write any
of what is currently in generic/mul.c in assembly.

I suppose as assembly mul.c would look like this:

  if (un == 1)  // vn == 1 too
    umul_ppmm
  else if (((up xor vp) or (un xor vn)) == 0)
    tailcall mpn_sqr
  else if (ABOVE_THRESHOLD (vm, MUL_TOOM22_THRESHOLD))
    tailcall mpn_mul_toom22_or_greater_written_in_C
  mpn_mul_basecase code here

On systems where tailcalls are possible (like amd64 ELF systems) this
will be very low-overhead also for sqr.

  On the generic (or no-asm) side, we could at least swap the first branches
  in mpn_mul. Currently we have:

    if (un == vn)
      {
        if (up == vp)
  	mpn_sqr (prodp, up, un);
        else
  	mpn_mul_n (prodp, up, vp, un);
      }
    else if (vn < MUL_TOOM22_THRESHOLD)
      { /* plain schoolbook multiplication */

  If we start with the if (vn < MUL_TOOM22_THRESHOLD) branch, we can
  (almost) directly consider _mul_basecase for small operands. Of course
  this is a slowdown for code that (should we keep on supporting this?)
  calls mpn_mul(r,a,n,a,n) instead of mpn_sqr(r,a,n).

It is surely silly that we don't have any mpn call for when it is known
that the multiplier and multiplicand are distinct.

-- 
Torbjörn
Please encrypt, key id 0xC8601622