hgcd1/2

Tue Sep 17 13:42:21 UTC 2019

nisse at lysator.liu.se (Niels Möller) writes:

  I've made a quick try deleting it from the single-limb loop. See patch
  below. Measurements are a bit noisy, but it looks like a slowdown when I
  time it. With hgcd2 time increasing from 1220 cycles to 1290 (this time
  measured on broadwell), which seems to be an increase of more than one
  cycle per iteration of this loop.

With which HGCD_DIV1_METHOD did you make these experiments?

For _METHOD 1 one almost surely want q = 1 special handling, at least
for Intel CPUs.  (Not as surely with AMD or ARM.)

Incidentally, my mpn_div_11 asm code didn't help any x86-64 CPUs.  The
speed was about the same.  Presumably inlining of the C variants
compensates for their slower per-bit speed.

I find it hard to accept that 25 cycles per iteration is as good as it
gets.  (25 cycles is Intel's best division instruction speed.)  I still
believe we could beat it soundly with a table-based approach if it only
rarely incurs a branch miss.

-- 
Torbjörn
Please encrypt, key id 0xC8601622