Possible new T3-T5 mul_1

Torbjorn Granlund tg at gmplib.org
Tue Apr 2 20:24:21 CEST 2013


David Miller <davem at davemloft.net> writes:

  See attached, looks like mul1b isn't able to reach 3 c/l like mul1a can.
  overhead 6.00 cycles, precision 10000000 units of 3.51e-10 secs, CPU freq 2847.41 MHz

Darn.  Is the load latency > 3 cycles?

The old code had a load-use schedule of 8 cycles, the new code 3.
Both variants schedule mulx/umulxhi 4 cycles from dependees.

Could you please  run these two code snippets and time them?

	.global	main

main:	sethi	%hi(2800000000), %g5
1:	mulx	%g1, %g1, %g1
	mulx	%g1, %g1, %g1
	mulx	%g1, %g1, %g1
	mulx	%g1, %g1, %g1
	brnz	%g5, 1b
	 dec	%g5
	retl
	 nop

main:	sethi	%hi(2800000000), %g5
1:	mulx	%g1, %g1, %g1
	add	%g1, %g1, %g1
	mulx	%g1, %g1, %g1
	add	%g1, %g1, %g1
	mulx	%g1, %g1, %g1
	add	%g1, %g1, %g1
	mulx	%g1, %g1, %g1
	add	%g1, %g1, %g1
	brnz	%g5, 1b
	 dec	%g5
	retl
	 nop


-- 
Torbjörn


More information about the gmp-devel mailing list