Iterative Toom-Cook-2.5 (Toom32) method for unbalanced multiplication

Wed Dec 23 10:12:31 CET 2009

Alle 18:48, martedì 22 dicembre 2009, hai scritto:
> Dear Alberto,
>
> That very much for this contribution to the GNU project!

Dear Torbjorn,
                        it's my very first time. Let me tell you that it was 
very complicated for me to "survive" in programming, as I usually need a lot 
of time to catch the "atmosphere" of a project. I have to thank Marco, that 
continuosly encouraged and guided me in writing the code. Without his support 
and advice I couldn't have written anythig.

> Last year, Niels Möller implemented toom22 (aka karatsuba) using
> separate "transforms" of one or both operands.  IIRC, he saw savings of
> around 10% per pre-transformed operand.
>
> His goal was speeding up computations with explicit invariance, such as
> matrix multiplication with bignum matrix elements.
>
> Your goal is a bit different; you want to optimise the implicit
> invariance of unbalanced multiplication.

What I wanted to do is to use the fact that evaluations of the shortest 
operand (let it be b) can be done just once (I think this is what you mean 
for "implicit invariance").

> Explicit and implicit invariance is the next big thing I hope to pursue
> in GMP, so your contribution comes very timely!

Very well, I'm happy for that !

>   2) Divide-by-2 technique, moving shifts from interpolation phase to
>      evaluation phase, when possible. For Toom-2.5 this means that if
>      evaluation of b in 1 (and therefore in -1, too) is even, it is
>      possible to do one single shift in b evaluation phase, and no
>      shifts at all in all interpolation phases during iteration.
>
> You could perhaps always shift, and save any lost bit separately?

If the least meaningful bit is 1, one should in practice do an extra sum, 
sooner or later. This can be done with an extra explicit operation (but in 
this case I think the benefit could be vanished), or with something like an 
addmul (I saw the other mails between you and Niels), but as this is my firt 
approach to mpn level, I didn't dare to do more than I could as "first step".

> Why did you chose toom32 as a primitive for very unbalanced operations.

Essentially because it is the esiest case of unbalanced Toom-Cook, and, being 
my first temptative, I decided to move my first steps in a gradual way. 
Moreover, I discovered that also (plain) toom32 can be bettered.

> Which would be more efficient, tom32 or, say, toom42?

For sure toom42, and in general toomX2. I'll send you my note in which I tryed 
to explain everything more clearly. Sorry I can't put it on the web, because 
it has still not been officially printed by the University service.

> One cost related to very unbalanced multiplication is the assembly of
> the toom-generated pieces.  This could be avoided by making the
> interpolation code understand the situation; it could add instead of
> writ the low n product limbs.  Do you do that?

Once again, I invite you to have a look to the note. If any doubt remains, 
I'll happy to detail.

>   The code contains two version of the functions. The first one (1)
>   considers only parity of b evaluation, the second (2) of b and of a's
>   sections. In my experiments, (1) seems to be a bit faster than (2) on
>   my architecture, but I report both versions. Maybe testing on other
>   architectures could give different results. Compile with
>
> I don't understand "of b and of a's sections".

Same as above.

>   I made some tests comparing the code with gmp-4.3.1, the development
>   version (December 2nd) and with a small modification suggested by
>   Marco. The maximum obtained gains lie in my case between 26 and 33 %.
>   I hope the code can someway be useful. Obviously corrections,
>   comments, changes, optimisations and whatever else are all welcome. Thank
> you very much in advance.
>
> Nice speedups!

If possible, I invite you to double check these times on all the machines 
types you have. May be you can add this method and produce new performance 
graphs.

> I hope you will not be disappointed if this code does not get integrated
> until after the upcoming release?

Of course not ! Take your time, as I imagine that checking new code before 
adding it to the library is a delicate matter, and probably the upcoming 
release is one step away. If one day my code will be considered, that will be 
enough for me.

Alberto