core2 add_n_sub_n

Tue Jun 1 22:20:51 CEST 2010

David Harvey <dmharvey at cims.nyu.edu> writes:

  This runs at about 3.5 c/l measured via

  ./speed -p 100000 -C -D -s 100-700 -t 100 mpn_add_n_sub_n

  This should be better then mpn_add_n + mpn_sub_n which would be
  currently about 4 c/l.

Plus that mpn_add_n_sub_n reads source operands just once, for better
behaviour when operands don't readily fit in cache.

  I do not know much about optimising for this chip. I'm wondering if
  anyone has any thoughts about what the maximum theoretical speed of
  mpn_add_n_sub_n should be on core2.

This is also a 3 insn/cycle chip, like Athlon/Phenom.

Plain add/sub and logops behave like on Athlon/Phenom, i.e. 3 per cycle
and latency 1 cycle.

But adc/sbb have a latency of 2 and a throughput of 1 per cycle.  This
adc/sbb latency will put a lot of stress on out-of-order resources, so
it might be good to help the chip by using some extra registers in order
to make it possible to schedule loads early.

Usually, a main problem with all P6 chips is that they only have 2
register file read ports (some newer have a third read port that can be
used only for certain purposes).  Sustaining good performance require
that read logical registers have been written to within about 7 cycles,
thereby the value is read off a feed forwardng bus, not the register
file.

If other causes makes us be far from executing 3 insn/cycle, it can help
to issue dummy copying insns such as 'lea (rax), rax'; thereby the
register will remain available on a feed forwarding bus.

This P6 quirk makes it hard to measure a loop's real-world performance,
and it can measurements unstable.  This is because if the loops is
slowed down by a context switch or singular cache miss, all registers
will get committed to the register file, and then all register reads
will subsequently be made from the register file, thus leading to slower
execution.  But won't the loop recover?  No, the initial slower
front-end execution will result in that generated register values get
committed, and unavailable at feed forwarding buses, causing register
read port contention to remain an issue.

(This will not happen in all loops, only in loops that are designed to
get a high ipc rate.)

-- 
Torbjörn