Possible new T3-T5 mul_1

David Miller davem at davemloft.net
Tue Apr 2 20:53:25 CEST 2013


From: Torbjorn Granlund <tg at gmplib.org>
Date: Tue, 02 Apr 2013 20:24:21 +0200

> David Miller <davem at davemloft.net> writes:
> 
>   See attached, looks like mul1b isn't able to reach 3 c/l like mul1a can.
>   overhead 6.00 cycles, precision 10000000 units of 3.51e-10 secs, CPU freq 2847.41 MHz
> 
> Darn.  Is the load latency > 3 cycles?

The L1 cache load-use latency is 5 cycles, the L2 load-use latency is
19 cycles, and the L3 load-use latency is 49 cycles.  But usually the
OoO unit hides much of this.

> Could you please  run these two code snippets and time them?
> 
> 	.global	main
> 
> main:	sethi	%hi(2800000000), %g5
> 1:	mulx	%g1, %g1, %g1
> 	mulx	%g1, %g1, %g1
> 	mulx	%g1, %g1, %g1
> 	mulx	%g1, %g1, %g1
> 	brnz	%g5, 1b
> 	 dec	%g5
> 	retl
> 	 nop

This runs in 47.234 seconds.

> 
> main:	sethi	%hi(2800000000), %g5
> 1:	mulx	%g1, %g1, %g1
> 	add	%g1, %g1, %g1
> 	mulx	%g1, %g1, %g1
> 	add	%g1, %g1, %g1
> 	mulx	%g1, %g1, %g1
> 	add	%g1, %g1, %g1
> 	mulx	%g1, %g1, %g1
> 	add	%g1, %g1, %g1
> 	brnz	%g5, 1b
> 	 dec	%g5
> 	retl
> 	 nop

And this runs in 51.158 seconds.


More information about the gmp-devel mailing list