hgcd1/2

Mon Sep 9 20:39:57 UTC 2019

tg at gmplib.org (Torbjörn Granlund) writes:

> I believe METHOD 1 could almost always be pointless if METHOD 3's
> straight line code is made really efficient.

I'd guess so too. But benchmarks are better than faith alone ;-)

> That should in particular be true if the q = 1 code is removed from
> hgcd2, for machines with good multiplication throughput.

Can we remove that for everything but HGCD2_METHOD == 2? It would be
nice to not have an explosion of cases.

> The normalisation code only works if operands are not already
> normalised.  Cannot they be that?

We call the normalizing div2 only when q > bh, which implies bh^2 < ah.
Then bh can't be normalized. When debugging (on 64-bit), I only saw this
code entered with size of bh exactly 32 bits, ah exactly 64 bits. But I
think it might be called with bh even smaller.

> Should be revive the commented-out div2?  It looks good to me, the
> comment about that it is slower than the branchy div2 is probably from
> the era where branchy div1/div2 was great.

I haven't had the time to benchmark it, but I wouldn't be surprised if
it runs better than the other div2.

> (I suspect that to be when Vax and m68020 was hot.)

IIRC, the alternative div2 was written 2008, and benchmarked on AMD
Opteron (or at most a few years older). One or both of div1 and the
"regular" div2 were copied from much older code.

> I think there might be a slight bug or inefficiency here (from hgcd2.c):
>
>       ASSERT (ah >= bh);
>
>       ah -= bh;
>       if (ah < (CNST_LIMB (1) << (GMP_LIMB_BITS / 2 + 1)))
> 	break;
>
>       if (ah <= bh)
> 	{
> 	  /* Use q = 1 */
> 	  u01 += u00;
> 	  u11 += u10;
> 	}
>       else
> 	{
> 	  mp_double_limb_t rq = div1 (ah, bh);
>
> If, in the the last 'if' above ah = bh, we have as a matter of fact q =
> 2 (with zero remainder).

The thing is, we don't want q = 2 and one number reduced to a zero. We
want to terminate with a matrix corresponding to both a and b roughly
half a limb (full limb, if it weren't for the truncation when going from
double precision to single precision), and |a - b| smaller.

If we made the change from "<=" to "<", the ah == bh case would continue to

            mp_double_limb_t rq = div1 (ah, bh);
q  <-- 1    mp_limb_t q = rq.d1;
ah <-- 0    ah = rq.d0;

cond true   if (ah < (CNST_LIMB(1) << (GMP_LIMB_BITS / 2 + 1)))
              {
                /* A is too small, but q is correct. */
                u01 += q * u00;
                u11 += q * u10;
                break;
              }

so we'll do the same matrix update, u01 += u00; u11 += u10; and exit the
loop.

If we keep "<=" as is, we instead update u01, u11 and continue to the
next iteration:

bh <-- 0    bh -= ah;
cond true   if (bh < (CNST_LIMB (1) << (GMP_LIMB_BITS / 2 + 1)))
              break;

and exit the loop. So we'll exit the loop correctly either way, but
using "<=" avoids the div1 call and a few multiplications just before
exit.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.