Small operands gcd improvements
Torbjörn Granlund
tg at gmplib.org
Tue Aug 13 11:18:08 UTC 2019
"Marco Bodrato" <bodrato at mail.dm.unipi.it> writes:
This means we are currently work on the _1o1o variants.
Yep. But the other entry points will be one or to extra insns.
May I propose a small latency-micro-optimisation for two x86_64 just
proposed variants? The idea is not to use the register %r10 at all, and
directly keep the value of v0 in %rax, so that it is already in place when
the function returns.
Your bd2 seems to cause no slowdowns and is shorter, so feel free to
commit.
Your core2 code is considerably faster for nhm and wsm, somewhat slower
for hwl, bwl, sky, and makes to difference for other CPUs which use this
code.
I tried another variant of the code, with 2x unfolding in order to
alternate the use of rax and v0; this removes a mov insn from one code
path:
FUNC_ENTRY(2)
jmp L(e) C
ALIGN(16) C K10 BD1 CNR NHM SBR
L(top): cmovc %rax, u0 C u = |v - u| 0,3 0,3 0,6 0,5 0,5
cmovc %r9, v0 C v = min(u,v) 0,3 0,3 2,8 1,7 1,7
bsf %rax, %rcx C 3 3 6 5 5
shr R8(%rcx), u0 C 1,7 1,6 2,8 2,8 2,8
L(e): mov v0, %rax C 1 1 4 3 3
sub u0, v0 C v - u 2 2 5 4 4
mov u0, %r9 C 2 2 3 3 4
sub %rax, u0 C u - v 2 2 4 3 4
jz L(end) C
cmovc v0, u0 C u = |v - u| 0,3 0,3 0,6 0,5 0,5
cmovc %r9, %rax C v = min(u,v) 0,3 0,3 2,8 1,7 1,7
bsf v0, %rcx C 3 3 6 5 5
shr R8(%rcx), u0 C 1,7 1,6 2,8 2,8 2,8
mov %rax, v0 C 1 1 4 3 3
sub u0, %rax C v - u 2 2 5 4 4
mov u0, %r9 C 2 2 3 3 4
sub v0, u0 C u - v 2 2 4 3 4
jnz L(top) C
L(e2): mov v0, %rax
L(end): FUNC_EXIT()
ret
Unfortunately, this code is not always an improvement either. It is
faster for cnr, pnr, bwl and sky. It is slower than your code for nhm
and wsm.
--
Torbjörn
Please encrypt, key id 0xC8601622
More information about the gmp-devel
mailing list