Wraparound multiplicaton vs mullo

Torbjorn Granlund tg at gmplib.org
Sun Oct 18 09:51:47 CEST 2009


nisse at lysator.liu.se (Niels Möller) writes:

  > I suppose you would need n divisible by 4, or would need to
  > add some bit shifts.
  
  In the current code, the x^2-1 algorithm requires that n is even, and
  the x^4-1 algorithm requires that n is divisible by 4. That could of
  course be relaxed at the cost ofextra shifting, assuming that
  the number of bits is divisible by 2 and 4, respectively.
  
Just to make sure we have the same understanding, what you mean is that
size*GMP_NUMB_BITS is divisible by 2 and 4, respectively, correct?

  Benchmarking mul_n, mullo_n, and the x^2-1 and x^4-1 algorithms over
  the same range of 2-500 limbs as before gives the following (still on
  AMD x86_32):
  
    2-10:  mpn_mul_n wins.
    
   12-22:  mpn_mullo_n wins, with at best a 14% saving (all savings are
           compared to mpn_mul_n). This is basecase mullo.
   
   24-50:  x^2-1 wins, with at best a 23% saving.
   
   52-500: x^4-1 wins, with at best a 35% saving.
  
  I'm not sure if I should be surprised that x^4-1 gets into the game so
  early. For the range of 64-500, it saves 30% or more.
  
  With cleverer evaluation and interpolation, speeding up the linear
  work of x^4-1, I'd expect the range where x^2-1 wins will get pretty
  narrow.
  
It is important to understand that mpn_mullow_basecase is substandard.
If we compare the time, counting in # of clock cycles for mul_basecase,
mullow_basecase, and sqr_basecase, for each accumulated cross product,
we see that mullow_basecase is pretty horrible.  The table below is for
Opteron/Athlon64, the situation is slightly similar for athlon32 (where
I think you benchmark).  The problems solved by mullow_basecase and
sqr_basecase have similar structures, and the corresponding functions
should run at similar speed.

Which proper mullow_basecase, mpn_mul_n should not be used at all.
Furthermore, x^2-1 will get pushed upwards.

There is currently no GMP configuration with a mullow_basecase.  For
athlon32, we have a funny situation with older sqr/mul_basecase and
newer addmul_1, resulting in that the C mullow_basecase runs
surprisingly well.  That probably makes your measurements reasonably
accurate, resembling a more ideal situation.  But even there
mullow_basecase is around 20% slower than sqr_basecase.

It might make sense to time your code on some machine without both
mul_basecase and mullow_basecase.  Or even better, write mullow_basecase
for at least x86_64.

             mul_basecase     mullow_basecase      sqr_basecase
 2               8.31              14.10               7.05
 4               4.41               8.87               4.64
 6               3.59               9.12               5.18
 8               3.39               8.24               4.49
10               3.09               6.80               4.05
12               2.90               6.09               3.78
14               2.82               5.81               3.73
16               2.73               5.61               3.64
18               2.70               5.39               3.41
20               2.64               5.22               3.37
22               2.63               5.21               3.33
24               2.59               4.99               3.28
26               2.58               5.09               3.24
28               2.57               4.93               3.20
30               2.55               4.86               3.17
32               2.55               4.77               3.13
34               2.53               4.74               3.10
36               2.51               4.65               3.07
38               2.70               4.57               3.06
40               2.68               4.57               3.06
42               2.67               4.50               3.03
44               2.63               4.40               3.15
46               2.66               4.35               3.15
48               2.62               4.26               3.09
50               2.63               4.23               3.07

-- 
Torbjörn


More information about the gmp-devel mailing list