hgcd1/2

Tue Sep 10 19:48:02 UTC 2019

nisse at lysator.liu.se (Niels Möller) writes:

  Maybe I'm confused. And I think tradeoffs are different for the
  double-limb and the single-limb loops.

Or I am confused.

I have mostly thought of div1, less of div2.

  For the single-limb loop, I think special handling of q == 1 means we
  avoid a division and three independent single-limb multiplies, at the
  cost of a mispredicted branch, plus some extra unconditional add/compere
  operations. I would expect that to be a good tradeoff only on machines
  with quite poor multiplication throughput.

I agree; even 64-bit ARM processors with very slow 64b x 64b and
moderate branch misprediction penalty probably are better off with the
branch reduced code.  But what was the saying, "benchmarking is better
that blind faith"?  :-D

  For the double-limb loop, special handling of q == 1 also avoids the
  operations needed to compute the double-limb remainder, which is a
  higher cost and with deeper dependency path.

I need to look deeper into this!

  > Or perhaps you reason along the lines of "if a CPU is good at small
  > quotient division, it ought to be good at multiplication, therefore
  > _METHOD 1 does not need q = 1 code which principally avoids some limb
  > multiplications"?

  I think that's a reasonable argument for the single-limb loop.

Only it is false. ARM A53, A57, A72, A73, A75 are counter examples.

I think multiply performance (at least its throughput) and small
quotient friendly division have zero correlation in current CPUs.  All
four combinations (slow+slow, slow+fast, fast+slow, and fast+fast)
exist.

  > Perhaps there is some confusion about which method is chosen for each
  > HGCD2_METHOD value?  You designated the original method number 2, your
  > plain / based one number 1.  I added lead texts to
  > https://gmplib.org/devel/thres/HGCD2_METHOD.html for clarity.

  And I was hoping the original method (#2) could be retired. At the
  moment, that web page indicates that it's useful only on ia64 aka itanic
  (14% margin to method 3). How much do we care about itanic performance?
  Is it worth writing div_11 in itanic asm just to eliminate METHOD 2?

The result comes from an old version of FreeBSD with their
intentionally obsolete GNU tools and thus compiler support libraries.
I tried replacing the libgcc function __udivdi3 with code from gcc 9,
and suddenly HGCD2_METHOD becomes 3.

Conclusion: The HGCD2_METHOD 2 code in its current form is not useful on
any of our platforms.  (Perhaps we should wait a few days more as not
all hosts have reported results yet.)

  > I suggest that we rename HGCD2_METHOD to DIV_11_METHOD.  Agree?

  Makes sense. But I think it should still be tuned based on hgcd2
  performance. So for speed speed and tuneup we'll get a matrix of
  hgcd2_${DIV_11_METHOD}_${DIV_22_METHOD} (for tuneup, DIV_22_METHOD
  likely depends on DIV_11_METHOD, but not the other way around, so we
  don't have to try all combinations).

OK.

What about hgcd2_jacobi?  Same, same, but different, I presume.

-- 
Torbjörn
Please encrypt, key id 0xC8601622