Karatsuba Speedup

Sun Feb 15 08:29:26 CET 2004

Josh Liu <zliu2 at student.gsu.edu> writes:
>
> It is advantageous to not repeat that addition that was mentioned in
> the manual. I have timed both mpn_kara_mul_n and my 5 half operand
> size addition method and it seems that the function call overhead is
> not much of a big deal when compared with the speedup possible by
> eliminating 1/6 of the additions at the tail end of the
> function.

The remark in the manual about this was based on experience.  We
measured code with the duplicate additions avoided, and it wasn't an
improvement.  I forget exactly, but I have an idea it might have been
just slightly slower.

Bear in mind addition can be as little as 2 cycles/limb, so when
getting into karatsuba at as few as 20 or 30 limbs you're looking at
only a saving of 20 or 30 cycles less an extra function call (or is it
2 calls?).  Either way it's not much compared to the real work of the
basecase multiply.

> Also, it might be possible to combine two addition
> operations into a single function, at least for the Pentium 4.

That may be a possibility for a few chips.

> My
> combined mpn_add_add_n does 3 cycles per limb versus mpn_add_n's 4
> cycles per limb.

That sounds unlikely, unless you have a radically different approach
than the current add_n code.  The latency of paddq+psrlq propagating
carry is 4 cycles for a start.

If you mean 4 instead of 3, then that certainly could be so, with a
loop of 4 mmx insns and just 2 on the dependent chain as now.

> And on another note, where did the derivation come from in the Other
> Multiplications section of the manual where the Toom-Cook algorithm
> evaluated at points -r,...,0,...r requiring the divisors of algorithm
> T step 7 in Knuth be j^2 and the multipliers be (2 j t - j^2) for T-8?

I figured them while making some code implementing a toom-N with those
points.  I'm pretty sure they're right, but I never made an actual
proof or anything.