Toom-4.5 (aka Toom-5x4, Toom-6x3, Toom-7x2)

Torbjorn Granlund tg at gmplib.org
Thu Oct 15 22:48:01 CEST 2009


David Harvey <dmharvey at cims.nyu.edu> writes:

  On Oct 12, 2009, at 9:04 AM, bodrato at mail.dm.unipi.it wrote:
  
  > Is the division by 3 (a divisor of B-1) twice as fast as the
  > division by 9
  > (not a divisor)? If it is not, then I'll prefer one slow division.
  > If it
  > is, then I'll #define div_by9() {div_by3(); div_by3();} ...
  
  Depends on the chip. For example, looking at the headers for mpn/
  x86_64/bdiv_dbm1c.asm and mpn/x86_64/dive_1.asm (are these the right
  files?), the cycle counts for K8 (opteron) are respectively 2.25
  cycles per limb and 10 cycles per limb. More than a factor of four.
  Note also that bdiv_dbm1 is even slightly faster than mul_1!! (2.5
  c/l)
  
I wrote a x86_64 "mpn_bdiv_dbm1dbm1", i.e., Hensel division by a divisor
of B minus 1 with repeated twice.  It runs at 4.5 c/l on AMD chips,
exactly twice that of mpn_bdiv_dbm1c.  It could be used for division by
9, 45, 75, 153, 225, etc.

I am working on mpn_bdiv_q_1_pi2, i.e. Hensel division by a one-limb
number d using a two-limb inverse d^(-1) mod B^2.  It should run at 6 or
7 c/l on AMD chips.  The code is simple.

Another approach for exact division that I tried last summer is to use
Jebelean's trick with truncating division from the left and Hensel
division from the right.  Since the two sub-algorithms run at different
latencies, one use a "limping" approach where more Hensel steps than
truncating steps are performed.  Theoretically, this could run at 5.65
c/l with some crazy unrolling.  In practice, 7 cycles might be
attainable.

I think mpn_bdiv_q_1_pi2 is the most sane approach for Toom's needs.
The two-limb inverse is no problem, since we're using compile-time
constants.  And unlike mpn_bdiv_dbm1dbm1, it handles any odd divisor.
It could also be made to handle even divisiors.

The name mpn_bdiv_q_1_pi2 might ask for an explanation.

 * bdiv is for Hensel division
 * q means that the quotient is stored
 * 1 is the limbs count of the divisor
 * pi2 is for precomputed inverse of 2 limbs.

-- 
Torbjörn


More information about the gmp-devel mailing list