"Core 2" GMPbench results
Torbjorn Granlund
tege at swox.com
Tue Aug 22 21:00:01 CEST 2006
Décio Luiz Gazzoni Filho <decio at decpp.net> writes:
According to Agner Fog's table of instruction latencies,
http://www.agner.org/optimize/instruction_tables.pdf
there isn't anything out of the ordinary in these loops. Pentium
4 had terrible timings for adc, but this was remedied in Core 2.
The adcq insn is used for mpn_mul_1 which runs at 4.5 c/l.
What if GMP is compiled in 32-bit mode? There were quite a few
problems with Pentium 4 in 64-bit mode as well, which should have
been remedied in Core 2 also.
I haven't had a chance testing that yet.
I do remember, though, that AMD's botched job at a 64-bit
extension of IA-32 invalidated certain opcodes, including that
of inc. That'd be a terrible situation to be in, since the inc
instruction doesn't update the carry flags while an addq %rcx, 1
would, which would ruin the algorithm. Still, could you try
changing that instruction (even if it would purposefully insert
a bug in the code) just to test this theory?
They took the one-byte inc and dec instructions, but in the wonderful
world of x86, there are other encodings for these operations, 3-byte
I think (for plain regs, that is).
Finally, doing some profiling using Core 2's built in performance
counters might reveal cache or similar problems, exceedingly
unlikely as they may be.
I am using tiny operands here.
I think we're hitting a pipeline glitch, causing a pipeline
replay. I'll try and ask Intel.
--
Torbjörn
More information about the gmp-discuss
mailing list