# GCD project status?

Niels Möller nisse at lysator.liu.se
Mon Sep 23 21:18:55 UTC 2019

```nisse at lysator.liu.se (Niels Möller) writes:

> Below is one variant, that seems to pass tests. I haven't done any
> benchmarks yet.

Just gave it a try with tune/speed, together with all the latest div1
variants. This one on my laptop, and not looking that great:

\$ ./tune/speed -c -s1 -p100000 mpn_hgcd2_1 mpn_hgcd2_2 mpn_hgcd2_3 mpn_hgcd2_4 mpn_hgcd2_5 mpn_hgcd2_binary
overhead 6.02 cycles, precision 100000 units of 8.33e-10 secs, CPU freq 1200.00 MHz
mpn_hgcd2_1   mpn_hgcd2_2   mpn_hgcd2_3   mpn_hgcd2_4 mpn_hgcd2_5 mpn_hgcd2_binary
1            #1668.90       1863.72       1670.73       1757.54     1738.50       2044.25

Had a look at the disassembly for the binary algorithm. The
double-precision loop needs, 20 instructions for just the conditional
swap logic, 23 for the clz + shift + subtract, 8 for the shift+add

By doing another conditional subtraction in each iteration, I get down
to 1865 cycles. See below version. In the case acnt == bcnt, the
condition for this is always false. But when acnt < bcnt, we have

2^{k-1} <= a < 2^k
2^{k-2} <= (2^s b) < 2^{k-1}

so floor (a / (b 2^s) can be 1, 2, or 3. So one could consider yet
another conditional subtraction (but then q = 3 2^s is no longer just a

If I also try the SIMD trick, I get down to 1790 cycles, but I'd need a
special middle iteration to make it correct. As long as both a,b >= 2^96
(more than one and a half limb), matrix coefficients fit in a half limb.
But we don't switch to single precision until a,b < 2^96. So we'll have
one or more iterations with a >= 2^96 > b, and there u01 and u11 may
exceeed half a limb. May still be worth it to add that middle loop, for
5% speedup.

Seems hard to beat method 1 on this hardware, but binary may well be a
winner on machines with slow multiply, slow division, but fast

Regards,
/Niels

mp_limb_t swap_xor = ((x) ^ (y)) & mask;	\
(x) ^= swap_xor;				\
(y) ^= swap_xor;				\
} while (0);

int
mpn_hgcd2 (mp_limb_t ah, mp_limb_t al, mp_limb_t bh, mp_limb_t bl,
struct hgcd_matrix1 *M)
{
mp_limb_t u00, u01, u10, u11, swapped;

if (ah < 2 || bh < 2)
return 0;

if (ah > bh || (ah == bh && al > bl))
{
sub_ddmmss (ah, al, ah, al, bh, bl);
if (ah < 2)
return 0;

u00 = u01 = u11 = 1;
u10 = 0;
}
else
{
sub_ddmmss (bh, bl, bh, bl, ah, al);
if (bh < 2)
return 0;

u00 = u10 = u11 = 1;
u01 = 0;
}

swapped = 0;

for (;;)
{
int acnt, bcnt, shift;

if (ah == bh)
goto done;

mask = -(mp_limb_t) (ah < bh);

ASSERT (ah > bh);

if (ah < (CNST_LIMB(1) << (GMP_LIMB_BITS / 2)))
{
ah = (ah << (GMP_LIMB_BITS / 2) ) + (al >> (GMP_LIMB_BITS / 2));
bh = (bh << (GMP_LIMB_BITS / 2) ) + (bl >> (GMP_LIMB_BITS / 2));

break;
}

shift = bcnt - acnt;
ASSERT (shift >= 0);
/* Avoid subtraction underflow */
shift -= (shift > 0);

dh = (bh << shift) | ((bl >> 1) >> (GMP_LIMB_BITS - 1 - shift));
dl = bl << shift;
sub_ddmmss (ah, al, ah, al, dh, dl);

mask = -(mp_limb_t) (ah > dh);

if (ah < 2)
{
/* A is too small, so decrement q. */
mp_limb_t q = ((mp_limb_t) 1 << shift) - 1;
u01 += q * u00;
u11 += q * u10;
goto done;
}
u01 += (u00 << shift);
u11 += (u10 << shift);
}

/* Single-precision loop */
for (;;)
{
int acnt, bcnt, shift;
if (ah == bh)
break;

mask = -(mp_limb_t) (ah < bh);

ASSERT (ah > bh);

shift = bcnt - acnt;
ASSERT (shift >= 0);
/* Avoid subtraction underflow */
shift -= (shift > 0);

dh = bh << shift;
ah -= dh;

mask = - (mp_limb_t) (ah > dh);

if (ah < (CNST_LIMB(1) << (GMP_LIMB_BITS / 2 + 1)))
{
/* A is too small, so decrement q. */
mp_limb_t q = ((mp_limb_t) 1 << shift) - 1;
u01 += q * u00;
u11 += q * u10;
break;
}

u01 += (u00 << shift);
u11 += (u10 << shift);
}

done:
ASSERT (u00 * u11 - u01 * u10 == 1);
M->u[0][0] = u00; M->u[0][1] = u01;
M->u[1][0] = u10; M->u[1][1] = u11;

return 1;
}

--
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
```