Small operands gcd improvements

Tue Jul 30 20:52:49 UTC 2019

nisse at lysator.liu.se (Niels Möller) writes:

  tg at gmplib.org (Torbjörn Granlund) writes:

  > The gcd_2 function of mpn/generic/gcd.c uses a loop with an randomly
  > taken branch, and therefore is slow.

  I think I at some point attempted a version with masking tricks, similar
  to the C gcd_1, but I didn't manage to get any useful speedup.

Here is one for x86-64.  It can use either shrd or as shrd is very slow
for recent CPUs from both AMD and Intel, a series of shrx/shlx.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: x64-gcd_22.asm
Type: application/octet-stream
Size: 1480 bytes
Desc: not available
URL: <https://gmplib.org/list-archives/gmp-devel/attachments/20190730/11dcb648/attachment.obj>
-------------- next part --------------

And here is a variant for arm32 v6t2/v7a:

-------------- next part --------------
A non-text attachment was scrubbed...
Name: arm64-gcd_22.asm
Type: application/octet-stream
Size: 1049 bytes
Desc: not available
URL: <https://gmplib.org/list-archives/gmp-devel/attachments/20190730/11dcb648/attachment-0001.obj>
-------------- next part --------------

And here is a variant for arm64:

-------------- next part --------------
A non-text attachment was scrubbed...
Name: arm32-gcd_22.asm
Type: application/octet-stream
Size: 1053 bytes
Desc: not available
URL: <https://gmplib.org/list-archives/gmp-devel/attachments/20190730/11dcb648/attachment-0002.obj>
-------------- next part --------------

And finally one for POWER9:

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ppc64p9-gcd_22.asm
Type: application/octet-stream
Size: 1263 bytes
Desc: not available
URL: <https://gmplib.org/list-archives/gmp-devel/attachments/20190730/11dcb648/attachment-0003.obj>
-------------- next part --------------

Surely much can be improved here.  The latter 3 files compute both U-V
and V-U, shift the result and then select the positive one.

The x86-64 is different as it uses conditional negation.

The name of the game here is critical path shortening.  The # of insn is
secondary.  That's particularly relevant for POWER9 which, like all
POWER processors, have a 2 cycle latency for any plain ALU operation.
There, we could add 50 extra instructions for a 2 cycle critical path
reduction (you cannot save 1 cycle, only 2, 4, 6, etc).

A common problem is that the low limb is the critical one as the right
shift of the intermediate values' low limb need expensive 2-limb
merging.

None of the files is completed, but reasonably tested.  The interface is
strange, as the exit operands are stored in gp[0] through gp[3].  That's
just done to benefit testing.  Perhaps gcd_22 should (conditionally)
invoke (or tail call) gcd_11, possibly it should let its caller invoke
gcd_11 to finish up an unfinished job.

  > What can we do?
  >
  > 1. Define mpn_gcd_11.

  What should its input requirements be? Should it require one argument,
  either argument, or both argument to be odd? If even inputs are allowed,
  do we also allow zero input?

  To me it makes some sense to require that common factors of two are
  taken care of at top-level, and at least rule out the case of two even
  inputs to gcd_11.

Agreed.

  > 2. Add a new entry point mpn_gcd_11 to existing assembly mpn_gcd_1.

  I hope we don't we have to do that all at once, but can the mpn_gcd_11
  entry point be optional (via HAVE_NATIVE_*)?

Yes, otherwise it'll never happen!

-- 
Torbjörn
Please encrypt, key id 0xC8601622