hgcd1/2
Torbjörn Granlund
tg at gmplib.org
Thu Sep 5 08:09:25 UTC 2019
nisse at lysator.liu.se (Niels Möller) writes:
I had a quick look at the machines that completed tuning this night.
These three seem to prefer the old code (HGCD2_METHOD == 2):
armcortexa7neon-unknown-linux-gnueabihf pi2.gmplib.org-stat
armcortexa8neon-unknown-linux-gnueabihf beagle.gmplib.org-stat
armcortexa12neon-unknown-linux-gnueabihf tinker.gmplib.org-stat
The others want HGCD_METHOD == 1, i.e., replacing div1 with plain
division.
The table generator program has now run. Also pi1 wants the old code.
The next question is how badly they want it. :-)
I checked the small quotient division speed of tinker and pi2. It is 16
and 36 cycles, respectively. With 36 cycles, I am not surprised the
division instructions should be avoided. But why does tinker prefer
div1 before a 16 cycles division instruction?
I suppose it has a lot to do with pipeline length too. Only bwl ran
tonight (of the known slow x86 dividers). It prefers its 25 cycle
division instruction. But it has a super long pipeline, and random
branches on average will cost (pipeline length)/2. The low-end ARM
chips like cortex a12 (used by tinker) have short pipelines.
Did you have a chance of playing with some code which takes care of
small quotients without branches or division?
I feel pretty confident that with 25 cycles to play (as with Intel CPUs)
some bit operations for quotients <= 7 would give great speeup. Such
code is perhaps best written in assembly, but even C code should help!
#if HAVE_NATIVE_div_1_lt8
...
#elif HAVE_NATIVE_div_1_lt4
...
#endif
--
Torbjörn
Please encrypt, key id 0xC8601622
More information about the gmp-devel
mailing list