GMP on Pentium 2
    Torbjorn Granlund 
    tg at swox.com
       
    Sat Nov  8 01:39:19 CET 2003
    
    
  
My measurements on ior.swox.se (pentium3) and hill.swox.se
(pentium2) and gustaf.swox.se (pentiumpro) suggest that mpn_add_n
from mpn/x86/aors_n.asm takes somewhat under 3.2 cycles/limb, not
the slower 3.7 cycles/limb claimed.  This is the opinion both by
tune/speed and tests/devel/add_n.c.  But mpn/x86/k7/aors_n.asm is
still considerably faster at about 2.75 cycles/limb.
I wonder why you are seeing 3.7 cycles/limb for the default p6
code.  On what machine did you measure that, Kevin?
Things become even more odd when measuring mpn_sub_n from the
same files.  Here, I can reproduce your 3.7 cycles/limb using the
mpn/x86/aors_n.asm version.  Sanity prevails with the k7 code, as
it measures in at 2.75 cycles/limb, like the k7 mpn_add_n code.
Kevin Ryde <user42 at zip.com.au> writes:
  It looks like the carry flag is not separately renamed on p6, so using
  decl serializes or something, costing at least 4 cycles.  I knew this
  problem existed on p4 but wasn't aware of it on p6.
The non-separate carry flag renaming shortcomings are not what
worries me for p4, but the 8 cycle latency for adc and sbb.  :-)
  Saving carry in a register and using subl [to restore it again]
  seems to help a mock-up loop.  Might have to go that way, or
  better scheduling, or lots more unrolling.  Followups to
  gmp-devel if anyone has good ideas (actually tested ideas, not
  random thoughts please :-).
How about sharing some code between k7 and p6?
We just found k7 code to be way superior, do we now
need to invent something completely different for p6?
-- 
Torbjörn
    
    
More information about the gmp-discuss
mailing list