zliu2 at student.gsu.edu
Mon Apr 26 17:06:11 CEST 2004
1) The release candidate gmp-4.1.3-rc1 compiles and runs correctly with
the unsupported Intel C++ Compiler, icc-8.0. The only problem is that
when one runs "make check", there is an error in t-constants.c. It
seems that the error is in the testing of USHRT_MAX. Therefore I
commented that one test of USHRT_MAX out and the tests all ran very
smoothly. I believe that the problem is that the Intel C++ compiler
treats shorts as 32-bit values. Perhaps a simple hack will suffice to
remedy this minor glitch.
2) I have worked on a version of the Sch\"ohage-Strassen multiplication
method to incorporate the Discrete Weighted Transform (DWT), from the
paper by Crandal and Fagin. The Discrete Weighted Transform allows one
to compute the negacyclic convolution of two polynomials. The lower and
upper halves of the product can be obtained by the very simple formulas
(c+n)/2 and (c-n)/2 respectively. The main advantage is that we are
doing twice the number of transforms on data sets that are halve the
length of the zero padded version. The, more accurate, asymptotic
result is O(2 N log N) instead of O(2 N log 2 N) of the zero padded
version. This implies a savings of two passes over the data. The
savings in time is as much as 20%-50% in some (large) cases.
On a side note, simple profiling indicates that low level functions,
mainly the negation function, takes up as much as 30% of the running
time of the Sch\"onhage-Strassen algorithm. Perhaps it is better to
use prefetching and non-temporal writing in the implementation of the
complement function, instead of the macro that is currently employed.
The other contention is the reconstruction of the product that amounts
to evaluating the polynomial at 2^m where m is the bit length of the
coefficients of the operand polynomials. This implies that I would need
either a better implementation of the evaluation or faster addition and
copying operations are required. My implementation works from the
highest order coefficients down to the lowest order coefficients so
that the mpn_copy functions can be used along with a partial mpn_add_n
+ mpn_add_1 instead of a full mpn_add.
3) I have a bottom-up Karatsuba algorithm. Torbjorn Granlund wanted a
non-recursive version of the Karatsuba and Toom-Cook algorithms in a
February post. I hope this bottom-up algorithm can be of use to him. I
am currently working on making the basecase operands fit in a single
cache line and aligning them. Also, the interpolation stage can benefit
from major re-implementation efforts. Given these two efforts, perhaps
this bottom-up algorithm can be faster than the current Karatsuba
4) I am in need of an efficient interpolation stage of the 4-way
Toom-Cook algorithm. I have reduced the number of nx1 multiplications
(exact divisions are basically nx1 multiplications) to 8 for the 4-way
and 18 for the 5-way algorithms without the even-odd splitting. With
splitting, the 4-way algorithm requires 5 nx1 multiplications.
5) I have an implementation of the Montgomery NTT. It appears to be a
lot slower than the Sch\"onhage-Strassen algorithm from subjective
measures. I believe than using my single-precision Montgomery
multiplication actually makes the algorithm slower than using a
multiplication and a division, at least on the Pentium 4. If anyone can
help me here please do, I would appreciate any help in this area. The
single-precision Montgomery multiplication requires 3 multiplication
instructions as well as 2 subtract with borrow instructions. I would
want to know if two of these multiplications can be reduced to one
because the computation $a + (- m' a \bmod b) m$ seems redundant, where
m' is the modular inverse of m modulo b.
The version of my NTT uses Bailey's 6-step algorithm. This seems to be
faster than a regular implementation of the NTT without Bailey's
reduction. Fortunately, the matrix transpose only add up to about 15%
of the computing time of the algorithm, and my matrix transpositions
are not very efficient.
Code is at http://www.student.gsu.edu/~zliu2/centrinia.tar.bz2.
More information about the gmp-devel