Several Points

Mon Apr 26 17:06:11 CEST 2004

1) The release candidate gmp-4.1.3-rc1 compiles and runs correctly with 
the unsupported Intel C++ Compiler, icc-8.0. The only problem is that 
when one runs "make check", there is an error in t-constants.c. It 
seems that the error is in the testing of USHRT_MAX. Therefore I 
commented that one test of USHRT_MAX out and the tests all ran very 
smoothly. I believe that the problem is that the Intel C++ compiler 
treats shorts as 32-bit values. Perhaps a simple hack will suffice to 
remedy this minor glitch.

2) I have worked on a version of the Sch\"ohage-Strassen multiplication 
method to incorporate the Discrete Weighted Transform (DWT), from the 
paper by Crandal and Fagin. The Discrete Weighted Transform allows one 
to compute the negacyclic convolution of two polynomials. The lower and 
upper halves of the product can be obtained by the very simple formulas 
(c+n)/2 and (c-n)/2 respectively. The main advantage is that we are 
doing twice the number of transforms on data sets that are halve the 
length of the zero padded version. The, more accurate, asymptotic 
result is O(2 N log N) instead of O(2 N log 2 N) of the zero padded 
version. This implies a savings of two passes over the data. The 
savings in time is as much as 20%-50% in some (large) cases.
On a side note, simple profiling indicates that low level functions, 
mainly the negation function, takes up as much as 30% of the running 
time of the Sch\"onhage-Strassen algorithm.  Perhaps it is better to 
use prefetching and non-temporal writing in the implementation of the 
complement function, instead of the macro that is currently employed. 
The other contention is the reconstruction of the product that amounts 
to evaluating the polynomial at 2^m where m is the bit length of the 
coefficients of the operand polynomials. This implies that I would need 
either a better implementation of the evaluation or faster addition and 
copying operations are required. My implementation works from the 
highest order coefficients down to the lowest order coefficients so 
that the mpn_copy functions can be used along with a partial mpn_add_n 
+ mpn_add_1 instead of a full mpn_add.

3) I have a bottom-up Karatsuba algorithm. Torbjorn Granlund wanted a 
non-recursive version of the Karatsuba and Toom-Cook algorithms in a 
February post. I hope this bottom-up algorithm can be of use to him. I 
am currently working on making the basecase operands fit in a single 
cache line and aligning them. Also, the interpolation stage can benefit 
from major re-implementation efforts. Given these two efforts, perhaps 
this bottom-up algorithm can be faster than the current Karatsuba 
algorithm.

4) I am in need of an efficient interpolation stage of the 4-way 
Toom-Cook algorithm. I have reduced the number of nx1 multiplications 
(exact divisions are basically nx1 multiplications) to 8 for the 4-way 
and 18 for the 5-way algorithms without the even-odd splitting. With 
splitting, the 4-way algorithm requires 5 nx1 multiplications.

5) I have an implementation of the Montgomery NTT. It appears to be a 
lot slower than the Sch\"onhage-Strassen algorithm from subjective 
measures. I believe than using my single-precision Montgomery 
multiplication actually makes the algorithm slower than using a 
multiplication and a division, at least on the Pentium 4. If anyone can 
help me here please do, I would appreciate any help in this area. The 
single-precision Montgomery multiplication requires 3 multiplication 
instructions as well as 2 subtract with borrow instructions. I would 
want to know if two of these multiplications can be reduced to one 
because the computation $a + (- m' a \bmod b) m$ seems redundant, where 
m' is the modular inverse of m modulo b.
The version of my NTT uses Bailey's 6-step algorithm. This seems to be 
faster than a regular implementation of the NTT without Bailey's 
reduction. Fortunately, the matrix transpose only add up to about 15% 
of the computing time of the algorithm, and my matrix transpositions 
are not very efficient.

Code is at http://www.student.gsu.edu/~zliu2/centrinia.tar.bz2.