Small operands gcd improvements

Tue Aug 13 11:18:08 UTC 2019

"Marco Bodrato" <bodrato at mail.dm.unipi.it> writes:

  This means we are currently work on the _1o1o variants.

Yep.  But the other entry points will be one or to extra insns.

  May I propose a small latency-micro-optimisation for two x86_64 just
  proposed variants? The idea is not to use the register %r10 at all, and
  directly keep the value of v0 in %rax, so that it is already in place when
  the function returns.

Your bd2 seems to cause no slowdowns and is shorter, so feel free to
commit.

Your core2 code is considerably faster for nhm and wsm, somewhat slower
for hwl, bwl, sky, and makes to difference for other CPUs which use this
code.

I tried another variant of the code, with 2x unfolding in order to
alternate the use of rax and v0; this removes a mov insn from one code
path:

	FUNC_ENTRY(2)
	jmp	L(e)		C

	ALIGN(16)		C              K10 BD1 CNR NHM SBR
L(top):	cmovc	%rax, u0	C u = |v - u|  0,3 0,3 0,6 0,5 0,5
	cmovc	%r9, v0		C v = min(u,v) 0,3 0,3 2,8 1,7 1,7
	bsf	%rax, %rcx	C              3   3   6   5   5
	shr	R8(%rcx), u0	C              1,7 1,6 2,8 2,8 2,8
L(e):	mov	v0, %rax	C              1   1   4   3   3
	sub	u0, v0		C v - u        2   2   5   4   4
	mov	u0, %r9		C              2   2   3   3   4
	sub	%rax, u0	C u - v        2   2   4   3   4
	jz	L(end)		C

	cmovc	v0, u0		C u = |v - u|  0,3 0,3 0,6 0,5 0,5
	cmovc	%r9, %rax	C v = min(u,v) 0,3 0,3 2,8 1,7 1,7
	bsf	v0, %rcx	C              3   3   6   5   5
	shr	R8(%rcx), u0	C              1,7 1,6 2,8 2,8 2,8
	mov	%rax, v0	C              1   1   4   3   3
	sub	u0, %rax	C v - u        2   2   5   4   4
	mov	u0, %r9		C              2   2   3   3   4
	sub	v0, u0		C u - v        2   2   4   3   4
	jnz	L(top)		C

L(e2):	mov	v0, %rax
L(end):	FUNC_EXIT()
	ret

Unfortunately, this code is not always an improvement either.  It is
faster for cnr, pnr, bwl and sky.  It is slower than your code for nhm
and wsm.

-- 
Torbjörn
Please encrypt, key id 0xC8601622