hgcd1/2

Tue Sep 10 15:33:19 UTC 2019

tg at gmplib.org (Torbjörn Granlund) writes:

> nisse at lysator.liu.se (Niels Möller) writes:
>
>   > That should in particular be true if the q = 1 code is removed from
>   > hgcd2, for machines with good multiplication throughput.
>
>   Can we remove that for everything but HGCD2_METHOD == 2? It would be
>   nice to not have an explosion of cases.
>
> Do you really mean HGCD2_METHOD 2?  I'd say that only _METHOD 1 is poor
> at q = 1, while _METHOD 2 and 3 handle it well.

Maybe I'm confused. And I think tradeoffs are different for the
double-limb and the single-limb loops.

For the single-limb loop, I think special handling of q == 1 means we
avoid a division and three independent single-limb multiplies, at the
cost of a mispredicted branch, plus some extra unconditional add/compere
operations. I would expect that to be a good tradeoff only on machines
with quite poor multiplication throughput.

For the double-limb loop, special handling of q == 1 also avoids the
operations needed to compute the double-limb remainder, which is a
higher cost and with deeper dependency path.

> Or perhaps you reason along the lines of "if a CPU is good at small
> quotient division, it ought to be good at multiplication, therefore
> _METHOD 1 does not need q = 1 code which principally avoids some limb
> multiplications"?

I think that's a reasonable argument for the single-limb loop.

> Perhaps there is some confusion about which method is chosen for each
> HGCD2_METHOD value?  You designated the original method number 2, your
> plain / based one number 1.  I added lead texts to
> https://gmplib.org/devel/thres/HGCD2_METHOD.html for clarity.

And I was hoping the original method (#2) could be retired. At the
moment, that web page indicates that it's useful only on ia64 aka itanic
(14% margin to method 3). How much do we care about itanic performance?
Is it worth writing div_11 in itanic asm just to eliminate METHOD 2?

>   We call the normalizing div2 only when q > bh, which implies bh^2 < ah.
>   Then bh can't be normalized. When debugging (on 64-bit), I only saw this
>   code entered with size of bh exactly 32 bits, ah exactly 64 bits. But I
>   think it might be called with bh even smaller.
>
> A comment stating that cnt = 0 is not possible (or an ASSERT!).

Makes sense.

> I suggest that we rename HGCD2_METHOD to DIV_11_METHOD.  Agree?

Makes sense. But I think it should still be tuned based on hgcd2
performance. So for speed speed and tuneup we'll get a matrix of
hgcd2_${DIV_11_METHOD}_${DIV_22_METHOD} (for tuneup, DIV_22_METHOD
likely depends on DIV_11_METHOD, but not the other way around, so we
don't have to try all combinations).

> Than we might want DIV_22_METHOD too.  

We'll likely need at least one bitwise, like the current one, and one
based on the selected div1 applied to the high limbs.

>   The thing is, we don't want q = 2 and one number reduced to a zero. We
>   want to terminate with a matrix corresponding to both a and b roughly
>   half a limb (full limb, if it weren't for the truncation when going from
>   double precision to single precision), and |a - b| smaller.
>
> [snip]
>
> Pretty subtle!

That's the "half" in hgcd. Numbers shouldn't be reduced too far.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.