hgcd1/2

Thu Sep 5 08:09:25 UTC 2019

nisse at lysator.liu.se (Niels Möller) writes:

  I had a quick look at the machines that completed tuning this night.
  These three seem to prefer the old code (HGCD2_METHOD == 2):

  armcortexa7neon-unknown-linux-gnueabihf  pi2.gmplib.org-stat
  armcortexa8neon-unknown-linux-gnueabihf  beagle.gmplib.org-stat
  armcortexa12neon-unknown-linux-gnueabihf  tinker.gmplib.org-stat

  The others want HGCD_METHOD == 1, i.e., replacing div1 with plain
  division. 

The table generator program has now run.  Also pi1 wants the old code.

The next question is how badly they want it.  :-)

I checked the small quotient division speed of tinker and pi2.  It is 16
and 36 cycles, respectively.  With 36 cycles, I am not surprised the
division instructions should be avoided.  But why does tinker prefer
div1 before a 16 cycles division instruction?

I suppose it has a lot to do with pipeline length too.  Only bwl ran
tonight (of the known slow x86 dividers).  It prefers its 25 cycle
division instruction.  But it has a super long pipeline, and random
branches on average will cost (pipeline length)/2.  The low-end ARM
chips like cortex a12 (used by tinker) have short pipelines.

Did you have a chance of playing with some code which takes care of
small quotients without branches or division?

I feel pretty confident that with 25 cycles to play (as with Intel CPUs)
some bit operations for quotients <= 7 would give great speeup.  Such
code is perhaps best written in assembly, but even C code should help!

#if HAVE_NATIVE_div_1_lt8
...
#elif HAVE_NATIVE_div_1_lt4
...
#endif

-- 
Torbjörn
Please encrypt, key id 0xC8601622