Improved Toom3 square
tege at swox.com
Tue Sep 19 11:12:39 CEST 2006
Jaewook Chung <j4chung at uwaterloo.ca> writes:
I have written an improved routine for mpn_toom3_sqr_n() (source code is
attached to this email). It uses the same algorithm as Zimmerman's
implementation currently used in GMP 4.2.x. Zimmerman does not use separate
variables for storing carries, but in my testing it turned out that it is
better to store carries in separate variables as Harley did in the previous
version of GMP v4.1.x.
In what sense is your routine improved?
Do you think your code is nicer, or is it measurably faster?
What are the numbers?
Pentium 4 is not the best platform for timing measurements,
as it is really odd in many respects.
Please use Athlon (32-bit or 64-bit), a Pentium 3, Pentium M, or
Core/Core 2. (Don't use Core 2 in 64-bit more though, with present
GMP assembly code, for tuning your code.)
I do not know whether "make check" thoroughly checks the correctness of toom3
square routine, but it passed the test. I have also done some correctness
check myself and have found no error.
I think tests/devel/try might be more suitable.
I have also tried to see whether using shift&add is better than using
mpn_addmul(), but in my testing there wasn't enough difference to conclude
that which one is better. In the code, both two methods are given, but
currently the method using mpn_addmul() is enabled by #if directive.
On Pentium 4, at least the 64-bit ones, both mpn_addmul_1 and
mpn_Xshift run equally slowly.
More information about the gmp-devel