# Wraparound multiplicaton vs mullo

Torbjorn Granlund tg at gmplib.org
Sun Oct 18 09:51:47 CEST 2009

```nisse at lysator.liu.se (Niels Möller) writes:

> I suppose you would need n divisible by 4, or would need to

In the current code, the x^2-1 algorithm requires that n is even, and
the x^4-1 algorithm requires that n is divisible by 4. That could of
course be relaxed at the cost ofextra shifting, assuming that
the number of bits is divisible by 2 and 4, respectively.

Just to make sure we have the same understanding, what you mean is that
size*GMP_NUMB_BITS is divisible by 2 and 4, respectively, correct?

Benchmarking mul_n, mullo_n, and the x^2-1 and x^4-1 algorithms over
the same range of 2-500 limbs as before gives the following (still on
AMD x86_32):

2-10:  mpn_mul_n wins.

12-22:  mpn_mullo_n wins, with at best a 14% saving (all savings are
compared to mpn_mul_n). This is basecase mullo.

24-50:  x^2-1 wins, with at best a 23% saving.

52-500: x^4-1 wins, with at best a 35% saving.

I'm not sure if I should be surprised that x^4-1 gets into the game so
early. For the range of 64-500, it saves 30% or more.

With cleverer evaluation and interpolation, speeding up the linear
work of x^4-1, I'd expect the range where x^2-1 wins will get pretty
narrow.

It is important to understand that mpn_mullow_basecase is substandard.
If we compare the time, counting in # of clock cycles for mul_basecase,
mullow_basecase, and sqr_basecase, for each accumulated cross product,
we see that mullow_basecase is pretty horrible.  The table below is for
Opteron/Athlon64, the situation is slightly similar for athlon32 (where
I think you benchmark).  The problems solved by mullow_basecase and
sqr_basecase have similar structures, and the corresponding functions
should run at similar speed.

Which proper mullow_basecase, mpn_mul_n should not be used at all.
Furthermore, x^2-1 will get pushed upwards.

There is currently no GMP configuration with a mullow_basecase.  For
athlon32, we have a funny situation with older sqr/mul_basecase and
surprisingly well.  That probably makes your measurements reasonably
accurate, resembling a more ideal situation.  But even there
mullow_basecase is around 20% slower than sqr_basecase.

It might make sense to time your code on some machine without both
mul_basecase and mullow_basecase.  Or even better, write mullow_basecase
for at least x86_64.

mul_basecase     mullow_basecase      sqr_basecase
2               8.31              14.10               7.05
4               4.41               8.87               4.64
6               3.59               9.12               5.18
8               3.39               8.24               4.49
10               3.09               6.80               4.05
12               2.90               6.09               3.78
14               2.82               5.81               3.73
16               2.73               5.61               3.64
18               2.70               5.39               3.41
20               2.64               5.22               3.37
22               2.63               5.21               3.33
24               2.59               4.99               3.28
26               2.58               5.09               3.24
28               2.57               4.93               3.20
30               2.55               4.86               3.17
32               2.55               4.77               3.13
34               2.53               4.74               3.10
36               2.51               4.65               3.07
38               2.70               4.57               3.06
40               2.68               4.57               3.06
42               2.67               4.50               3.03
44               2.63               4.40               3.15
46               2.66               4.35               3.15
48               2.62               4.26               3.09
50               2.63               4.23               3.07

--
Torbjörn
```