# Toom-4.5 (aka Toom-5x4, Toom-6x3, Toom-7x2)

Torbjorn Granlund tg at gmplib.org
Thu Oct 15 22:48:01 CEST 2009

```David Harvey <dmharvey at cims.nyu.edu> writes:

On Oct 12, 2009, at 9:04 AM, bodrato at mail.dm.unipi.it wrote:

> Is the division by 3 (a divisor of B-1) twice as fast as the
> division by 9
> (not a divisor)? If it is not, then I'll prefer one slow division.
> If it
> is, then I'll #define div_by9() {div_by3(); div_by3();} ...

Depends on the chip. For example, looking at the headers for mpn/
x86_64/bdiv_dbm1c.asm and mpn/x86_64/dive_1.asm (are these the right
files?), the cycle counts for K8 (opteron) are respectively 2.25
cycles per limb and 10 cycles per limb. More than a factor of four.
Note also that bdiv_dbm1 is even slightly faster than mul_1!! (2.5
c/l)

I wrote a x86_64 "mpn_bdiv_dbm1dbm1", i.e., Hensel division by a divisor
of B minus 1 with repeated twice.  It runs at 4.5 c/l on AMD chips,
exactly twice that of mpn_bdiv_dbm1c.  It could be used for division by
9, 45, 75, 153, 225, etc.

I am working on mpn_bdiv_q_1_pi2, i.e. Hensel division by a one-limb
number d using a two-limb inverse d^(-1) mod B^2.  It should run at 6 or
7 c/l on AMD chips.  The code is simple.

Another approach for exact division that I tried last summer is to use
Jebelean's trick with truncating division from the left and Hensel
division from the right.  Since the two sub-algorithms run at different
latencies, one use a "limping" approach where more Hensel steps than
truncating steps are performed.  Theoretically, this could run at 5.65
c/l with some crazy unrolling.  In practice, 7 cycles might be
attainable.

I think mpn_bdiv_q_1_pi2 is the most sane approach for Toom's needs.
The two-limb inverse is no problem, since we're using compile-time
constants.  And unlike mpn_bdiv_dbm1dbm1, it handles any odd divisor.
It could also be made to handle even divisiors.

The name mpn_bdiv_q_1_pi2 might ask for an explanation.

* bdiv is for Hensel division
* q means that the quotient is stored
* 1 is the limbs count of the divisor
* pi2 is for precomputed inverse of 2 limbs.

--
Torbjörn
```