GMP on Pentium 2
Torbjorn Granlund
tg at swox.com
Sat Nov 8 01:39:19 CET 2003
My measurements on ior.swox.se (pentium3) and hill.swox.se
(pentium2) and gustaf.swox.se (pentiumpro) suggest that mpn_add_n
from mpn/x86/aors_n.asm takes somewhat under 3.2 cycles/limb, not
the slower 3.7 cycles/limb claimed. This is the opinion both by
tune/speed and tests/devel/add_n.c. But mpn/x86/k7/aors_n.asm is
still considerably faster at about 2.75 cycles/limb.
I wonder why you are seeing 3.7 cycles/limb for the default p6
code. On what machine did you measure that, Kevin?
Things become even more odd when measuring mpn_sub_n from the
same files. Here, I can reproduce your 3.7 cycles/limb using the
mpn/x86/aors_n.asm version. Sanity prevails with the k7 code, as
it measures in at 2.75 cycles/limb, like the k7 mpn_add_n code.
Kevin Ryde <user42 at zip.com.au> writes:
It looks like the carry flag is not separately renamed on p6, so using
decl serializes or something, costing at least 4 cycles. I knew this
problem existed on p4 but wasn't aware of it on p6.
The non-separate carry flag renaming shortcomings are not what
worries me for p4, but the 8 cycle latency for adc and sbb. :-)
Saving carry in a register and using subl [to restore it again]
seems to help a mock-up loop. Might have to go that way, or
better scheduling, or lots more unrolling. Followups to
gmp-devel if anyone has good ideas (actually tested ideas, not
random thoughts please :-).
How about sharing some code between k7 and p6?
We just found k7 code to be way superior, do we now
need to invent something completely different for p6?
--
Torbjörn
More information about the gmp-discuss
mailing list