GMP on Pentium 2

Torbjorn Granlund tg at
Sat Nov 8 01:39:19 CET 2003

My measurements on (pentium3) and
(pentium2) and (pentiumpro) suggest that mpn_add_n
from mpn/x86/aors_n.asm takes somewhat under 3.2 cycles/limb, not
the slower 3.7 cycles/limb claimed.  This is the opinion both by
tune/speed and tests/devel/add_n.c.  But mpn/x86/k7/aors_n.asm is
still considerably faster at about 2.75 cycles/limb.

I wonder why you are seeing 3.7 cycles/limb for the default p6
code.  On what machine did you measure that, Kevin?

Things become even more odd when measuring mpn_sub_n from the
same files.  Here, I can reproduce your 3.7 cycles/limb using the
mpn/x86/aors_n.asm version.  Sanity prevails with the k7 code, as
it measures in at 2.75 cycles/limb, like the k7 mpn_add_n code.

Kevin Ryde <user42 at> writes:

  It looks like the carry flag is not separately renamed on p6, so using
  decl serializes or something, costing at least 4 cycles.  I knew this
  problem existed on p4 but wasn't aware of it on p6.

The non-separate carry flag renaming shortcomings are not what
worries me for p4, but the 8 cycle latency for adc and sbb.  :-)

  Saving carry in a register and using subl [to restore it again]
  seems to help a mock-up loop.  Might have to go that way, or
  better scheduling, or lots more unrolling.  Followups to
  gmp-devel if anyone has good ideas (actually tested ideas, not
  random thoughts please :-).

How about sharing some code between k7 and p6?
We just found k7 code to be way superior, do we now
need to invent something completely different for p6?


More information about the gmp-discuss mailing list