hgcd1/2
Torbjörn Granlund
tg at gmplib.org
Tue Sep 10 19:48:02 UTC 2019
nisse at lysator.liu.se (Niels Möller) writes:
Maybe I'm confused. And I think tradeoffs are different for the
double-limb and the single-limb loops.
Or I am confused.
I have mostly thought of div1, less of div2.
For the single-limb loop, I think special handling of q == 1 means we
avoid a division and three independent single-limb multiplies, at the
cost of a mispredicted branch, plus some extra unconditional add/compere
operations. I would expect that to be a good tradeoff only on machines
with quite poor multiplication throughput.
I agree; even 64-bit ARM processors with very slow 64b x 64b and
moderate branch misprediction penalty probably are better off with the
branch reduced code. But what was the saying, "benchmarking is better
that blind faith"? :-D
For the double-limb loop, special handling of q == 1 also avoids the
operations needed to compute the double-limb remainder, which is a
higher cost and with deeper dependency path.
I need to look deeper into this!
> Or perhaps you reason along the lines of "if a CPU is good at small
> quotient division, it ought to be good at multiplication, therefore
> _METHOD 1 does not need q = 1 code which principally avoids some limb
> multiplications"?
I think that's a reasonable argument for the single-limb loop.
Only it is false. ARM A53, A57, A72, A73, A75 are counter examples.
I think multiply performance (at least its throughput) and small
quotient friendly division have zero correlation in current CPUs. All
four combinations (slow+slow, slow+fast, fast+slow, and fast+fast)
exist.
> Perhaps there is some confusion about which method is chosen for each
> HGCD2_METHOD value? You designated the original method number 2, your
> plain / based one number 1. I added lead texts to
> https://gmplib.org/devel/thres/HGCD2_METHOD.html for clarity.
And I was hoping the original method (#2) could be retired. At the
moment, that web page indicates that it's useful only on ia64 aka itanic
(14% margin to method 3). How much do we care about itanic performance?
Is it worth writing div_11 in itanic asm just to eliminate METHOD 2?
The result comes from an old version of FreeBSD with their
intentionally obsolete GNU tools and thus compiler support libraries.
I tried replacing the libgcc function __udivdi3 with code from gcc 9,
and suddenly HGCD2_METHOD becomes 3.
Conclusion: The HGCD2_METHOD 2 code in its current form is not useful on
any of our platforms. (Perhaps we should wait a few days more as not
all hosts have reported results yet.)
> I suggest that we rename HGCD2_METHOD to DIV_11_METHOD. Agree?
Makes sense. But I think it should still be tuned based on hgcd2
performance. So for speed speed and tuneup we'll get a matrix of
hgcd2_${DIV_11_METHOD}_${DIV_22_METHOD} (for tuneup, DIV_22_METHOD
likely depends on DIV_11_METHOD, but not the other way around, so we
don't have to try all combinations).
OK.
What about hgcd2_jacobi? Same, same, but different, I presume.
--
Torbjörn
Please encrypt, key id 0xC8601622
More information about the gmp-devel
mailing list