Toom-4.5 (aka Toom-5x4, Toom-6x3, Toom-7x2)

Fri Oct 9 23:15:43 CEST 2009

David Harvey <dmharvey at cims.nyu.edu> writes:

> (1) can the evaluation points in Marco's code be changed so that
> division by 45 becomes something else that succumbs to these tricks?

Evaluation in 0, oo, ±1, ±2, ±1/2 gives a matrix with determinant

  9331200 = 2^9 * 3^6 * 5^2

So these are the numbers we have to divide out somewhere in the
interpolation. If we want to use the B-1 trick, we can only divide by
3, 5 or 15 at a time, so we'd need at least 6 divisions. Combining
factors, such as 45 = 3^2 * 5, gives fewer but more expensive
divisions.

> (2) can you do fast Hensel division by B^2 - 2*B + 1? Maybe the only
> way is to divide by B - 1 twice, or maybe something faster is possible.

The multiplicative inverse (mod B^2) is 2B + 1, so I'd expect division
to be cheap. But to get a division by 45, you'd also have to multiply
by (B^2 - 2B + 1 ) / 45, which is a two limb number. I'd expect that
to be quite expensive (although with a nicer recurrency than general
divexact which has a long series of dependent multiplications)

> (3) can you do fast Hensel division by 2^64 - 16? (It's not odd, but
> maybe some sense can be made of this?)

Anyway, I think hensel division or divexact by d = 2^60 - 1 should be
fairly efficient, or at least multiplication free. Even id we do it
one full limb at a time. The multiplicative inverse mod 2^64 is 2^60 +
1, so in the loop the lowest quotient word would be computed as

  q_0 = - (u_0 + 2^60 u_0)  (mod B)

and then the update of the partial remainder is

  U -= q_0 u_0

I have some difficulty figuring out exactly what the effect of the mod
B truncation of q_0 will be (we truncate the top 60 bits in 2^60 u0,
then there's a possible carry from the addition, and finally the
negation).

Ignore some of this issues, by putting u0 = 2^4 uh + ul and
considering the quotient word - (u0 + 2^60 ul) (to which one needs to
add 0, B or 2B to get the q_0 above), we get the product

  (u0 + 2^60 ul) (2^60 - 1) = - u0 + 2^64 uh + 2^120 ul 

So I think it should be possible to do the update of the partial
remainder without multiplication, but perhaps with somewhat complex
carry logic. Since the (uh + 2^56 ul) update term is the sum of 60-bit
quantities, there's no risk of overflow there, so maybe one could do
the update corresponding to a quotient word with some extra bit, and
do the needed carry handling on the q-side?

That would mean, compute the negated quotient, word by word, with
carry

  {cout, q_j} = u_j + (2^60 u_j mod 2^64) + cin

and do the update as

  U += B^{j+1} ( floor (2^{-4} u_j) + 2^56 (u_j mod 16))

(here we also have carry propagation to do between iterations).

Essentially the same method should apply to all divisors of the form
2^k - 1, with k >= word size / 2, with

  (u0 + 2^k ul) (2^k - 1) = - u0 + B uh + 2^{2k} u_l

Regards,
/Niels