[PATCH] Optimize 32-bit sparc T1 multiply routines.

Sun Jan 6 13:08:25 CET 2013

David Miller <davem at davemloft.net> writes:

  Thanks for your help, the following works.  I'll work on unrolling
  and scheduling it.

  PROLOGUE(mpn_sub_nc)
  	ba,pt	%xcc, L(ent)
  	 xor	cy, 1, cy
  EPILOGUE()
  PROLOGUE(mpn_sub_n)
  	mov	1, cy
  L(ent):	cmp	%g0, cy
  L(top):	ldx	[up+0], %o4
  	add	up, 8, up
  	ldx	[vp+0], %o5
  	add	vp, 8, vp
  	add	rp, 8, rp
  	add	n, -1, n
  	xnor	%o5, %g0, %o5
  	addxccc	%o4, %o5, %g3
  	brgz	n, L(top)
  	 stx	%g3, [rp-8]

  	clr	%o0
  	retl
  	 movcc	%xcc, 1, %o0
  EPILOGUE()

Since we are working with a throughput constrained pipeline, we should
really use as few insns as possible.

There are 6 operation insns, and it seems hard to use less than 5
bookkeeping insns.  With k-way unrolling we should then get to
max(3,(6k+5)/(2k)) cycles/limb.

For small k, we could put the pointers the end of its operands, then use
a combined index and loop counter -n...0.  This would give
max(3,(7k+1)/(2k)) cycles/limb.

(The max(3...) handles the load/store bandwidth limit.  It has no
limiting effect for sub_n, but it does for add_n.)

sub_n:
  n      method 1   method 2
  1        5.5        4.0
  2        4.2        3.8
  3        3.8        3.7
  4        3.6        3.6
  5        3.5        3.6
  6        3.4        3.6
  7        3.4        3.6
  8        3.3        3.6
 oo        3.0        3.5

add_n:
  n      method 1   method 2
  1        5.0        3.5
  2        3.8        3.2
  3        3.3        3.2
  4        3.1        3.1
  5        3.0        3.1
  6        3.0        3.1
  7        3.0        3.1
  8        3.0        3.1
 oo        3.0        3.0

For add_n, I recommend either method 1 with 4-way unrolling, or method 2
with 2-way unrolling.

For sub_n we should use at least 4-way unrolling.

-- 
Torbjörn