AMD-64 optimizations, some (new) code
Torbjorn Granlund
tege at swox.com
Mon Sep 26 12:09:03 CEST 2005
Ashod Nakashian <saghmos at xter.net> writes:
Finally I've successfully ported popham.asm and mul_1.asm. mul_1
uses software prefetching and my tests show that the current code
is the fastest (~3 c/l, and as low as 2.3 c/l for about 700-750
limbs).
I cannot reproduce 2.3 c/l for mpn_mul_1.
The best value I get is 3.3 for n=900.
The loop needs 68 decode cycles, or 2.83 decode cycles per limb.
--
Torbjörn
More information about the gmp-discuss
mailing list