AMD-64 optimizations, some (new) code

Ashod Nakashian saghmos at
Mon Sep 26 18:17:25 CEST 2005

Torbjorn Granlund wrote:
> Ashod Nakashian <saghmos at> writes:
>   Finally I've successfully ported popham.asm and mul_1.asm. mul_1
>   uses software prefetching and my tests show that the current code
>   is the fastest (~3 c/l, and as low as 2.3 c/l for about 700-750
>   limbs).
> I cannot reproduce 2.3 c/l for mpn_mul_1.
> The best value I get is 3.3 for n=900.
> The loop needs 68 decode cycles, or 2.83 decode cycles per limb.
> --
> Torbjörn

Well, I'd agree with you, if the numbers weren't so consistent. In fact, 
they are so consistent that two runs of the same range give, in quite a 
lot of points, the very same results, to the last digit after the point. 
This includes the fastest point, which, by the way, is NOT 2.3 c/l 
(sorry), but 2.17 !!! Not a typo. This value is reached at 920 limbs.

And after all, I expect 'speed' to report the FASTEST time recorded, so 
it really can't be a glich, unless the result of the function is also 
wrong (for certain inputs and not for others, since the code is really 
working correctly in all my tests) or speed has some accuracy problem.

So my only question is, could this be the differene bettwen processor 
revisions? I mean we have seen HUGE changes in decode/dispatch speed of 
certain instructions from revision to revision, the last of which was in 
P4 Presscot.

My CPU is Clawhammer, 0.13 micron process. I had more info, like the 
revision and stepping, I can't find them now. I'll try to get some 
software to dump the info and I'll send it if you are interested.

Now, to sort this out, I also attached 3 different runs of 'speed' with 
mpn_mul_1.1. 50-50k in 10 steps, 600-1200 in 10 steps (this is the range 
where the fastest timings are found) and 600-1200 in 25 steps (just to 
put the previous number in prespective and see the overall graph which 
is identical in the 3 runs.) I dared send other runs, since the data is 
really very consistent, but I guess these would do. You can check out 
the data, and see the comman-line parammeters passed to 'speed'. Hope it 
helps you.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: mul_1_stats.tar.bz2
Type: application/octet-stream
Size: 35668 bytes
Desc: not available
Url :

More information about the gmp-discuss mailing list