hgcd1/2
Niels Möller
nisse at lysator.liu.se
Tue Sep 10 15:33:19 UTC 2019
tg at gmplib.org (Torbjörn Granlund) writes:
> nisse at lysator.liu.se (Niels Möller) writes:
>
> > That should in particular be true if the q = 1 code is removed from
> > hgcd2, for machines with good multiplication throughput.
>
> Can we remove that for everything but HGCD2_METHOD == 2? It would be
> nice to not have an explosion of cases.
>
> Do you really mean HGCD2_METHOD 2? I'd say that only _METHOD 1 is poor
> at q = 1, while _METHOD 2 and 3 handle it well.
Maybe I'm confused. And I think tradeoffs are different for the
double-limb and the single-limb loops.
For the single-limb loop, I think special handling of q == 1 means we
avoid a division and three independent single-limb multiplies, at the
cost of a mispredicted branch, plus some extra unconditional add/compere
operations. I would expect that to be a good tradeoff only on machines
with quite poor multiplication throughput.
For the double-limb loop, special handling of q == 1 also avoids the
operations needed to compute the double-limb remainder, which is a
higher cost and with deeper dependency path.
> Or perhaps you reason along the lines of "if a CPU is good at small
> quotient division, it ought to be good at multiplication, therefore
> _METHOD 1 does not need q = 1 code which principally avoids some limb
> multiplications"?
I think that's a reasonable argument for the single-limb loop.
> Perhaps there is some confusion about which method is chosen for each
> HGCD2_METHOD value? You designated the original method number 2, your
> plain / based one number 1. I added lead texts to
> https://gmplib.org/devel/thres/HGCD2_METHOD.html for clarity.
And I was hoping the original method (#2) could be retired. At the
moment, that web page indicates that it's useful only on ia64 aka itanic
(14% margin to method 3). How much do we care about itanic performance?
Is it worth writing div_11 in itanic asm just to eliminate METHOD 2?
> We call the normalizing div2 only when q > bh, which implies bh^2 < ah.
> Then bh can't be normalized. When debugging (on 64-bit), I only saw this
> code entered with size of bh exactly 32 bits, ah exactly 64 bits. But I
> think it might be called with bh even smaller.
>
> A comment stating that cnt = 0 is not possible (or an ASSERT!).
Makes sense.
> I suggest that we rename HGCD2_METHOD to DIV_11_METHOD. Agree?
Makes sense. But I think it should still be tuned based on hgcd2
performance. So for speed speed and tuneup we'll get a matrix of
hgcd2_${DIV_11_METHOD}_${DIV_22_METHOD} (for tuneup, DIV_22_METHOD
likely depends on DIV_11_METHOD, but not the other way around, so we
don't have to try all combinations).
> Than we might want DIV_22_METHOD too.
We'll likely need at least one bitwise, like the current one, and one
based on the selected div1 applied to the high limbs.
> The thing is, we don't want q = 2 and one number reduced to a zero. We
> want to terminate with a matrix corresponding to both a and b roughly
> half a limb (full limb, if it weren't for the truncation when going from
> double precision to single precision), and |a - b| smaller.
>
> [snip]
>
> Pretty subtle!
That's the "half" in hgcd. Numbers shouldn't be reduced too far.
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list