Improved Toom3 square

Tue Sep 19 11:12:39 CEST 2006

Jaewook Chung <j4chung at uwaterloo.ca> writes:

  I have written an improved routine for mpn_toom3_sqr_n() (source code is 
  attached to this email). It uses the same algorithm as Zimmerman's 
  implementation currently used in GMP 4.2.x. Zimmerman does not use separate 
  variables for storing carries, but in my testing it turned out that it is 
  better to store carries in separate variables as Harley did in the previous 
  version of GMP v4.1.x.

In what sense is your routine improved?
Do you think your code is nicer, or is it measurably faster?
What are the numbers?

Pentium 4 is not the best platform for timing measurements,
as it is really odd in many respects.

Please use Athlon (32-bit or 64-bit), a Pentium 3, Pentium M, or
Core/Core 2.  (Don't use Core 2 in 64-bit more though, with present
GMP assembly code, for tuning your code.)

  I do not know whether "make check" thoroughly checks the correctness of toom3 
  square routine, but it passed the test. I have also done some correctness 
  check myself and have found no error.

I think tests/devel/try might be more suitable.

  I have also tried to see whether using shift&add is better than using 
  mpn_addmul(), but in my testing there wasn't enough difference to conclude 
  that which one is better. In the code, both two methods are given, but 
  currently the method using mpn_addmul() is enabled by #if directive.

On Pentium 4, at least the 64-bit ones, both mpn_addmul_1 and
mpn_Xshift run equally slowly.

-- 
Torbjörn