[PATCH] Optimize 32-bit sparc T1 multiply routines.

Sun Jan 6 13:48:42 CET 2013

David Miller <davem at davemloft.net> writes:

  From: Torbjorn Granlund <tg at gmplib.org>
  Date: Fri, 04 Jan 2013 14:54:15 +0100

  > 1:      umulxhi %g5, %g5, %g1
  >         mulx    %g5, %g5, %g2
  >         umulxhi %g5, %g5, %g3
  >         mulx    %g5, %g5, %g4
  >         umulxhi %g5, %g5, %i1
  >         mulx    %g5, %g5, %i2
  >         umulxhi %g5, %g5, %i3
  >         mulx    %g5, %g5, %i4
  >         brnz,a  %i0, 1b
  >         addx    %i0, -1, %i0
  > 
  >         ret

  This runs in 9 seconds with a suitably initialized %i0
  for this processor (2.8GHZ)

Not bad.  It seems safe to assume the pipeline can sustain one integer
mul per cycle.

I recommend 4-way unrolling.

The summation method of mpn/powerpc64/mode64/aorsmul_1.asm might be
best.  Unfortunately, its submul_1 method will not work.  I cannot see
how we can avoid an extra instruction for every limb.

This is the PPC code:

L(top):	mulld	r0, r9, r6
	mulhdu	r5, r9, r6
	mulld	r7, r27, r6
	mulhdu	r8, r27, r6
	ld	r9, 0(up)
	ld	r28, 0(rp)
	ld	r27, 8(up)
	ld	r29, 8(rp)
	adde	r0, r0, r12
	adde	r7, r7, r5
	mulld	r5, r9, r6
	mulhdu	r10, r9, r6
	mulld	r11, r27, r6
	mulhdu	r12, r27, r6
	ld	r9, 16(up)
	ld	r30, 16(rp)
	ld	r27, 24(up)
	ld	r31, 24(rp)
	adde	r5, r5, r8
	adde	r11, r11, r10
	addze	r12, r12
	ADDSUB	r0, r0, r28
	std	r0, 0(rp)
	ADDSUBC	r7, r7, r29
	std	r7, 8(rp)
	ADDSUBC	r5, r5, r30
	std	r5, 16(rp)
	ADDSUBC	r11, r11, r31
	std	r11, 24(rp)
	addi	up, up, 32
	addi	rp, rp, 32
	bdnz	L(top)

I suspect one should schedule the ld insns sooner, into the mul block.
That might require a few more temp registers, which is no cost for
SPARC.

PPC     SPARC
mulld   mulx
mulhdu  umulxhi
adde    addxccc
addze   addxccc %g0
ADDSUB  addxc
ADDSUBC addxccc  (an alias for adde in the code when doing addmul_1)

It will be 33 insns, which could take 4.25 cycles/limb.  Without
addmul_2, addmul_1 will be the most critical function, and should get
lots of attention.

With addmul_2, one saves loads and stores.  With addmul_(k) for large k
we will approach just two mul insns and two addxccc per cross multiply,
which I assume would take 2 cycles.  An addmul_2 should take around
3+epsilon cycles for this CPU, an addmul_3 should take 2.67+epsilon and
an addmul_4 should take 2.5+epsilon.

An addmul_4 is usually smaller than addmul_1 since the former usually
requires no unrolling.

This is a complex optimisation project, in particular with the odd mpmul
insn.  I suspect addmul_(k) for k up to 4 would be hard to beat in
mul_basecase, except when m=n.  But for large enough m (assuming m >= n
as mul_basecase does) one could use mpmul(n) (repeated if m >= 2n), then
mpmul(m mod n) (again possible repeated), until we reach a threshold.

-- 
Torbjörn