Possible new T3-T5 mul_1

Tue Apr 2 21:09:51 CEST 2013

David Miller <davem at davemloft.net> writes:

  The L1 cache load-use latency is 5 cycles, the L2 load-use latency is
  19 cycles, and the L3 load-use latency is 49 cycles.  But usually the
  OoO unit hides much of this.

This must be industry-leading L1 latency.  :-)

  This runs in 47.234 seconds.

  And this runs in 51.158 seconds.

Sorry for my doubt, 12 cycles just sounded impossible for a fully
pipelined function unit.  At least they wired the mul unit close to the
simple op unit; usually super-long latencies are more due to distance
than deep function unit pipelining.  Not here.

So what's going on in the a and b code variants?  I assume the total OoO
capacity was just not enough for a ld->mul->add 17 cycle chain scheduled
at just 3+4 cycles.  with fully scheduled loads, the OoO requirement was
just 12 vs 4 cycles.

I would assume that one could either choose to schedule loads >= 5
cycles from mulx, or schedule the addxccc one or two cycles further away
from mulx...

-- 
Torbjörn