Possible new T3-T5 mul_1
David Miller
davem at davemloft.net
Tue Apr 2 20:53:25 CEST 2013
From: Torbjorn Granlund <tg at gmplib.org>
Date: Tue, 02 Apr 2013 20:24:21 +0200
> David Miller <davem at davemloft.net> writes:
>
> See attached, looks like mul1b isn't able to reach 3 c/l like mul1a can.
> overhead 6.00 cycles, precision 10000000 units of 3.51e-10 secs, CPU freq 2847.41 MHz
>
> Darn. Is the load latency > 3 cycles?
The L1 cache load-use latency is 5 cycles, the L2 load-use latency is
19 cycles, and the L3 load-use latency is 49 cycles. But usually the
OoO unit hides much of this.
> Could you please run these two code snippets and time them?
>
> .global main
>
> main: sethi %hi(2800000000), %g5
> 1: mulx %g1, %g1, %g1
> mulx %g1, %g1, %g1
> mulx %g1, %g1, %g1
> mulx %g1, %g1, %g1
> brnz %g5, 1b
> dec %g5
> retl
> nop
This runs in 47.234 seconds.
>
> main: sethi %hi(2800000000), %g5
> 1: mulx %g1, %g1, %g1
> add %g1, %g1, %g1
> mulx %g1, %g1, %g1
> add %g1, %g1, %g1
> mulx %g1, %g1, %g1
> add %g1, %g1, %g1
> mulx %g1, %g1, %g1
> add %g1, %g1, %g1
> brnz %g5, 1b
> dec %g5
> retl
> nop
And this runs in 51.158 seconds.
More information about the gmp-devel
mailing list