AMD-64 optimizations, some (new) code
Ashod Nakashian
saghmos at xter.net
Mon Sep 26 18:17:25 CEST 2005
Torbjorn Granlund wrote:
> Ashod Nakashian <saghmos at xter.net> writes:
>
> Finally I've successfully ported popham.asm and mul_1.asm. mul_1
> uses software prefetching and my tests show that the current code
> is the fastest (~3 c/l, and as low as 2.3 c/l for about 700-750
> limbs).
>
> I cannot reproduce 2.3 c/l for mpn_mul_1.
> The best value I get is 3.3 for n=900.
> The loop needs 68 decode cycles, or 2.83 decode cycles per limb.
>
> --
> Torbjörn
>
>
Well, I'd agree with you, if the numbers weren't so consistent. In fact,
they are so consistent that two runs of the same range give, in quite a
lot of points, the very same results, to the last digit after the point.
This includes the fastest point, which, by the way, is NOT 2.3 c/l
(sorry), but 2.17 !!! Not a typo. This value is reached at 920 limbs.
And after all, I expect 'speed' to report the FASTEST time recorded, so
it really can't be a glich, unless the result of the function is also
wrong (for certain inputs and not for others, since the code is really
working correctly in all my tests) or speed has some accuracy problem.
So my only question is, could this be the differene bettwen processor
revisions? I mean we have seen HUGE changes in decode/dispatch speed of
certain instructions from revision to revision, the last of which was in
P4 Presscot.
My CPU is Clawhammer, 0.13 micron process. I had more info, like the
revision and stepping, I can't find them now. I'll try to get some
software to dump the info and I'll send it if you are interested.
Now, to sort this out, I also attached 3 different runs of 'speed' with
mpn_mul_1.1. 50-50k in 10 steps, 600-1200 in 10 steps (this is the range
where the fastest timings are found) and 600-1200 in 25 steps (just to
put the previous number in prespective and see the overall graph which
is identical in the 3 runs.) I dared send other runs, since the data is
really very consistent, but I guess these would do. You can check out
the data, and see the comman-line parammeters passed to 'speed'. Hope it
helps you.
Regards,
Ash
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mul_1_stats.tar.bz2
Type: application/octet-stream
Size: 35668 bytes
Desc: not available
Url : http://gmplib.org/list-archives/gmp-discuss/attachments/20050926/cb5c1c0d/mul_1_stats.tar-0001.obj
More information about the gmp-discuss
mailing list