user42 at zip.com.au
Sun Feb 15 08:29:26 CET 2004
Josh Liu <zliu2 at student.gsu.edu> writes:
> It is advantageous to not repeat that addition that was mentioned in
> the manual. I have timed both mpn_kara_mul_n and my 5 half operand
> size addition method and it seems that the function call overhead is
> not much of a big deal when compared with the speedup possible by
> eliminating 1/6 of the additions at the tail end of the
The remark in the manual about this was based on experience. We
measured code with the duplicate additions avoided, and it wasn't an
improvement. I forget exactly, but I have an idea it might have been
just slightly slower.
Bear in mind addition can be as little as 2 cycles/limb, so when
getting into karatsuba at as few as 20 or 30 limbs you're looking at
only a saving of 20 or 30 cycles less an extra function call (or is it
2 calls?). Either way it's not much compared to the real work of the
> Also, it might be possible to combine two addition
> operations into a single function, at least for the Pentium 4.
That may be a possibility for a few chips.
> combined mpn_add_add_n does 3 cycles per limb versus mpn_add_n's 4
> cycles per limb.
That sounds unlikely, unless you have a radically different approach
than the current add_n code. The latency of paddq+psrlq propagating
carry is 4 cycles for a start.
If you mean 4 instead of 3, then that certainly could be so, with a
loop of 4 mmx insns and just 2 on the dependent chain as now.
> And on another note, where did the derivation come from in the Other
> Multiplications section of the manual where the Toom-Cook algorithm
> evaluated at points -r,...,0,...r requiring the divisors of algorithm
> T step 7 in Knuth be j^2 and the multipliers be (2 j t - j^2) for T-8?
I figured them while making some code implementing a toom-N with those
points. I'm pretty sure they're right, but I never made an actual
proof or anything.
More information about the gmp-devel