An experiment of Negacyclic and Cyclic convolution on Schonhage's multiplication.

Jeunder Yu gis91542 at cis.nctu.edu.tw
Tue Nov 2 14:16:31 CET 2004

The new trick of Paul's code is to split one transform to two shorter
transforms, and reconstructs product by CRT. For instance, if we have
two operands A and B, A is nl limbs and B is ml limbs, two shorter
transforms are 2*N limbs and 3*N limbs respectively,
where 2*N + 3*N >= nl + ml.

I have a new idea, because the time complexity of transform algorithm
is a convex function, if the length of two shorter transforms are as
close as possible, the time cost will be smaller.

The convolution of current algorithm is negacyclic, the result of
negacyclic convolution is A*B mod 2^(N*BITS_PER_MP_LIMB)+1. It is
possible to perform a cyclic convolution in Schonhage's algorithm,
if we skips the angle adjustment step (i.e. don't multiply the i-th
element by 2^(i*Mp)), we will have a cyclic convolution,
i.e. A*B mod 2^(N*BITS_PER_MP_LIMB)-1.

We can perform two shorter transforms, both transform are N limbs,
one is negacyclic convolution and another one is cyclic convolution,
where N + N >= nl + ml, and the product can be reconstructed by CRT,
the final result is
A*B mod [2^(N*BITS_PER_MP_LIMB)+1]*[2^(N*BITS_PER_MP_LIMB)-1]
= A*B mod 2^(2*N*BITS_PER_MP_LIMB)-1.

Here is the result of experiment, two operands are size in 2^n limbs
with random value. The column old is Paul's code (head), column new0
is old + iterated_fft (my trick),
column new1 is new0 + negacyclic_cyclic_convolution.

+-------------------------------------+
| size |   old   |   new0  |   new1   |
+------+---------+---------+----------+
| n=13 |  0.0108 |  0.0103 |  *0.0092 |
| n=14 |  0.0230 |  0.0217 |  *0.0194 |
| n=15 |  0.0492 |  0.0461 |   0.0490 |
| n=16 |  0.1274 |  0.1211 |  *0.1201 |
| n=17 |  0.2653 |  0.2524 |   0.2676 |
| n=18 |  0.6635 |  0.6466 |   0.6569 |
| n=19 |  1.4387 |  1.3693 |   1.4495 |
| n=20 |  2.8044 |  2.6499 |  *2.6175 |
| n=21 |  5.8088 |  5.4526 |  *5.1213 |
| n=22 | 12.1742 | 11.4590 | *10.7597 |
| n=23 | 25.6633 | 24.1082 |  24.7214 |
+-------------------------------------+

It is strange, new1 is faster than new0 on star (*) rows (n=13,14,16,
20,21,22), but in the other rows, new1 is slower than new0. We must
research into the detail of algorithm, we can enable the traces of
code (i.e. #define TRACE(x) x). For n=17, the output of trace is as
below.

=============================== new0 ================================
mpn_mul_fft_full nl=131072 ml=131072 -> pl2=106496 pl3=159744 k=10

mpn_mul_fft pl=159744 nl=131072 ml=131072 k=10
N=5111808 K=1024, M=4992, l=156, maxLK=1024, Np=10240, np=320
159744x159744 limbs -> 1024 times 320x320 limbs (0.97) <- WATCH THIS!!!
temp space 657408
pl=159744 k=10 K=1024 np=320 l=156 Mp=10 rec=0 sqr=0
mpn_mul_n 1024 of 320 limbs

mpn_mul_fft pl=106496 nl=131072 ml=131072 k=10
N=3407872 K=1024, M=3328, l=104, maxLK=1024, Np=7168, np=224
106496x106496 limbs -> 1024 times 224x224 limbs (0.93) <- WATCH THIS!!!
temp space 460800
pl=106496 k=10 K=1024 np=224 l=104 Mp=7 rec=0 sqr=0
mpn_mul_n 1024 of 224 limbs

=============================== new1 ================================
mpn_mul_fft_full nl=131072 ml=131072 -> pl2=131072 pl3=131072 k=10

mpn_mul_fft pl=131072 nl=131072 ml=131072 k=10
N=4194304 K=1024, M=4096, l=128, maxLK=1024, Np=9216, np=288
131072x131072 limbs -> 1024 times 288x288 limbs (0.89) <- WATCH THIS!!!
temp space 591872
pl=131072 k=10 K=1024 np=288 l=128 Mp=9 rec=0 sqr=0
mpn_mul_n 1024 of 288 limbs

mpn_mul_fft pl=131072 nl=131072 ml=131072 k=10
N=4194304 K=1024, M=4096, l=128, maxLK=1024, Np=9216, np=288
131072x131072 limbs -> 1024 times 288x288 limbs (0.89) <- WATCH THIS!!!
temp space 591872
pl=131072 k=10 K=1024 np=288 l=128 Mp=9 rec=0 sqr=0
mpn_mul_n 1024 of 288 limbs

=====================================================================

Two transforms of new0 are 106496 limbs and 159744 limbs, and two
transforms of new1 are both 131072 limbs.

Although 106496 + 159744 > 2 * 131072, but Np must be the multiple of
K=2^10, there is some wasteful bits. The BUR (bit-utility rate) of
new0 are 0.97 and 0.93, and the BUR of new1 are both 0.89. This is
the reason for new1 slower than new0 on some size of opreands.

+------+---------+---------+---------+----------+------+
| size |   old   |   new0      BUR   |   new1      BUR |
+------+---------+---------+---------+----------+------+
| n=13 |  0.0108 |  0.0103 | .97 .96 |  *0.0092 | 0.89 |
| n=14 |  0.0230 |  0.0217 | .97 .93 |  *0.0194 | 0.94 |
| n=15 |  0.0492 |  0.0461 | .97 .93 |   0.0490 | 0.89 |
| n=16 |  0.1274 |  0.1211 | .97 .93 |  *0.1201 | 0.94 |
| n=17 |  0.2653 |  0.2524 | .97 .93 |   0.2676 | 0.89 |
| n=18 |  0.6635 |  0.6466 | .97 .99 |   0.6569 | 0.94 |
| n=19 |  1.4387 |  1.3693 | .97 .92 |   1.4495 | 0.89 |
| n=20 |  2.8044 |  2.6499 | .96 .99 |  *2.6175 | 0.99 |
| n=21 |  5.8088 |  5.4526 | .99 .99 |  *5.1213 | 0.97 |
| n=22 | 12.1742 | 11.4590 | .99 .99 | *10.7597 | 0.97 |
| n=23 | 25.6633 | 24.1082 | .98 .98 |  24.7214 | 0.92 |
+------+---------+---------+---------+----------+------+
(PS. The BUR of old is same as new0.)

If two transforms are size in N0 = N - delta0 limbs and
N1 = N + delta1 limbs respectively, where N = ceil ( (nl + ml) / 2),
and N0 + N1 >= nl + ml, we can adjust delta0 and delta1 values, and
also considers the BUR, this maybe betther, but the CRT step will be
more complex.

--
Jeunder Yu