mpn_mul is embarrassingly slow

Tue Apr 24 15:52:00 UTC 2018

nisse at lysator.liu.se (Niels Möller) writes:

  I would prefer the opposite change for GMP7, to have all multiplication
  functions return, but *not* store, the high limb of the product. Which
  also should work nicely with tail calls.

I believe this would work nicely for mul_basecase but not for toom or
fft multiply.  These latter would need extra temp space, or code
contortions.

  I had to look up what clever things we try to do when vn is small but un
  is huge. It seems we try to improve locality by breaking U up in pieces
  which hopefully fit the cache.

Yes, that code used to improve things very much at some point.  (No CPU
sets MUL_BASECASE_MAX_UN, the default 500 means we add some overhead for
modern CPUs where L1D is much greater than 500 limbs.)

  >     if (un == vn)
  >       {
  > !       mpn_mul_n (prodp, up, vp, un);
  >       }

  I wonder if calling mpn_mul_n from here still is useful? Given that both
  basecase and toom functions support unbalanced operands, and we don't
  even have any mpn_toom*_mul_n. Main thing that makes mpn_mul_n nicer is
  that thresholds get a lot easier with only a single size.

Perhaps one could omit it without drawbacks.  Only timing will tell.
:-)

-- 
Torbjörn
Please encrypt, key id 0xC8601622