Possible new T3-T5 mul_1
tg at gmplib.org
Tue Apr 2 21:09:51 CEST 2013
David Miller <davem at davemloft.net> writes:
The L1 cache load-use latency is 5 cycles, the L2 load-use latency is
19 cycles, and the L3 load-use latency is 49 cycles. But usually the
OoO unit hides much of this.
This must be industry-leading L1 latency. :-)
This runs in 47.234 seconds.
And this runs in 51.158 seconds.
Sorry for my doubt, 12 cycles just sounded impossible for a fully
pipelined function unit. At least they wired the mul unit close to the
simple op unit; usually super-long latencies are more due to distance
than deep function unit pipelining. Not here.
So what's going on in the a and b code variants? I assume the total OoO
capacity was just not enough for a ld->mul->add 17 cycle chain scheduled
at just 3+4 cycles. with fully scheduled loads, the OoO requirement was
just 12 vs 4 cycles.
I would assume that one could either choose to schedule loads >= 5
cycles from mulx, or schedule the addxccc one or two cycles further away
More information about the gmp-devel