"Core 2" GMPbench results

Tue Aug 22 21:00:01 CEST 2006

Décio Luiz Gazzoni Filho <decio at decpp.net> writes:

  According to Agner Fog's table of instruction latencies,

  http://www.agner.org/optimize/instruction_tables.pdf

  there isn't anything out of the ordinary in these loops. Pentium
  4  had terrible timings for adc, but this was remedied in Core 2.

The adcq insn is used for mpn_mul_1 which runs at 4.5 c/l.

  What if GMP is compiled in 32-bit mode? There were quite a few
  problems with Pentium 4 in 64-bit mode as well, which should have
  been remedied in Core 2 also.

I haven't had a chance testing that yet.

  I do remember, though, that AMD's botched job at a 64-bit
  extension  of IA-32 invalidated certain opcodes, including that
  of inc. That'd  be a terrible situation to be in, since the inc
  instruction doesn't  update the carry flags while an addq %rcx, 1
  would, which would ruin  the algorithm. Still, could you try
  changing that instruction (even  if it would purposefully insert
  a bug in the code) just to test this  theory?

They took the one-byte inc and dec instructions, but in the wonderful
world of x86, there are other encodings for these operations, 3-byte
I think (for plain regs, that is).

  Finally, doing some profiling using Core 2's built in performance
  counters might reveal cache or similar problems, exceedingly
  unlikely  as they may be.

I am using tiny operands here.

I think we're hitting a pipeline glitch, causing a pipeline
replay.  I'll try and ask Intel.

-- 
Torbjörn