# Wraparound multiplicaton vs mullo

Niels Möller nisse at lysator.liu.se
Thu Oct 22 12:25:43 CEST 2009

```bodrato at mail.dm.unipi.it writes:

> I think it would be better to implement a recursive function:
> compute the product Mod x^4-1 by computing it Mod x^2-1 and Mod
> x^2+1, then using CRT.

I've implemented a divide-and-conquer algorithm like that now, and it
looks good.

CRT is based on 1 = B^n/2 (B^n+1) - (B^n/2 + 1)(B^n-1)

When n has a single trailing zero, performance is about the same as
x^2-1 (not surprising), saving around 25% compared to a full
multiplication. I'm still benchmarking over range 2-512 limbs, on an
x86_32.

When n has two trailing zerros, the dc algorithm beats both x^2-1 and
x^4-1, with typical savings of 25%, 32% and 35%.

When n has three trailing zeros, the dc algorithm gives a saving of
39%. Increasing the number of trailing zeros further gives pretty
small additional improvements in this range.

So for numbers of a couple of hundred limbs, multiplication mod 2^n -
1 is 67% faster (40% time saved) compared to full multiplication.

And it beats current mullo from 32 limbs and up. Some data for small
sizes:

n    mul  mullo mul_wrap x^2-1   x^4-1     dc1   dc (cycles, ratio)
2     78  1.333#  1.355   5.199   0.000   1.538   1.538  1
4    143  1.229#  1.334   3.427   9.255   1.433   1.435  2
6    239  1.133#  1.219   2.486   0.000   1.277   1.276  1
8    397  0.981#  1.130   1.842   3.982   1.156   1.156  3
10    585  0.898#  1.109   1.543   0.000   1.131   1.129  1
12    830  0.840#  1.074   1.185   2.239   1.101   1.104  2
14   1094  0.849#  1.064   1.103   0.000   1.074   1.071  1
16   1433  0.744#  1.055   1.050   1.702   0.973   0.973  4
18   1789  0.781#  1.046   0.941   0.000   0.913   0.913  1
20   2147  0.730#  1.052   0.893   1.292   0.937   0.938  2
22   2413  0.814#  1.086   0.897   0.000   0.894   0.894  1
24   2919  0.742#  1.045   0.840   1.105   0.825   0.825  3
26   3304  0.812#  1.045   0.831   0.000   0.816   0.816  1
28   3824  0.783#  1.020   0.820   0.947   0.787   0.790  2
30   4202  0.828   1.035   0.815   0.000   0.801   0.801# 1
32   4778  0.789#  1.029   0.869   0.927   0.836   0.800  5
34   5331  0.805   1.025   0.788   0.000   0.772   0.772# 1
36   5875  0.791   1.022   0.784   0.853   0.736   0.726# 2
38   6373  0.823   1.034   0.802   0.000   0.785   0.785# 1
40   7102  0.788   1.016   0.786   0.802   0.713   0.713# 3
42   7772  0.794   1.020   0.765   0.000   0.757   0.756# 1
44   8030  0.833   1.021   0.795   0.772   0.732   0.724# 2
46   8909  0.835   1.021   0.770   0.000   0.755   0.752# 1
48   9376  0.793   1.017   0.776   0.743   0.693#  0.694  4
50  10220  0.827   1.013   0.750   0.000   0.752   0.744# 1
52  10544  0.835   1.019   0.768   0.732   0.701   0.689# 2
54  11514  0.854   1.016   0.737   0.000   0.731   0.729# 1
56  11819  0.844   1.013   0.762   0.722   0.685   0.683# 3
58  12928  0.849   1.013   0.736   0.000   0.731   0.731# 1
60  13222  0.871   1.012   0.758   0.719   0.675#  0.675  2
62  14598  0.836   1.015   0.727   0.000   0.726#  0.726  1
64  14994  0.837   1.011   0.764   0.704   0.667   0.661# 6

In the x^4-1 column, zero means not applicable. dc1 is divide and
conquer which never recurses to itself (useful for tuning the
threshold, here it's 16). And last column is number of trailing zeros.

Regards,
/Niels

-------------- next part --------------
A non-text attachment was scrubbed...
Name: mul_2nm1.c
Type: application/octet-stream
Size: 17990 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20091022/883b9519/attachment-0001.obj>
```