GMP on Pentium 2

Sat Nov 8 01:39:19 CET 2003

My measurements on ior.swox.se (pentium3) and hill.swox.se
(pentium2) and gustaf.swox.se (pentiumpro) suggest that mpn_add_n
from mpn/x86/aors_n.asm takes somewhat under 3.2 cycles/limb, not
the slower 3.7 cycles/limb claimed.  This is the opinion both by
tune/speed and tests/devel/add_n.c.  But mpn/x86/k7/aors_n.asm is
still considerably faster at about 2.75 cycles/limb.

I wonder why you are seeing 3.7 cycles/limb for the default p6
code.  On what machine did you measure that, Kevin?

Things become even more odd when measuring mpn_sub_n from the
same files.  Here, I can reproduce your 3.7 cycles/limb using the
mpn/x86/aors_n.asm version.  Sanity prevails with the k7 code, as
it measures in at 2.75 cycles/limb, like the k7 mpn_add_n code.

Kevin Ryde <user42 at zip.com.au> writes:

  It looks like the carry flag is not separately renamed on p6, so using
  decl serializes or something, costing at least 4 cycles.  I knew this
  problem existed on p4 but wasn't aware of it on p6.

The non-separate carry flag renaming shortcomings are not what
worries me for p4, but the 8 cycle latency for adc and sbb.  :-)

  Saving carry in a register and using subl [to restore it again]
  seems to help a mock-up loop.  Might have to go that way, or
  better scheduling, or lots more unrolling.  Followups to
  gmp-devel if anyone has good ideas (actually tested ideas, not
  random thoughts please :-).

How about sharing some code between k7 and p6?
We just found k7 code to be way superior, do we now
need to invent something completely different for p6?

-- 
Torbjörn