General mpn_gcd_basecase
Torbjörn Granlund
tg at gmplib.org
Fri Aug 30 11:56:14 UTC 2019
nisse at lysator.liu.se (Niels Möller) writes:
I doubt sign multiply would be that useful. An instruction to do signed
64b x unsigned 64b --> signed 128b, would be nice, but I don't know any
arch having that.
Unsigned ans signed widening multiplication can be done in terms of ther
other. There is code in longlong.h for doing that:
/* If we lack umul_ppmm but have smul_ppmm, define umul_ppmm in terms of
smul_ppmm. */
#if !defined (umul_ppmm) && defined (smul_ppmm)
#define umul_ppmm(w1, w0, u, v) \
do { \
UWtype __w1; \
UWtype __xm0 = (u), __xm1 = (v); \
smul_ppmm (__w1, w0, __xm0, __xm1); \
(w1) = __w1 + (-(__xm0 >> (W_TYPE_SIZE - 1)) & __xm1) \
+ (-(__xm1 >> (W_TYPE_SIZE - 1)) & __xm0); \
} while (0)
#endif
I think signed x unsigned is about half that hard. :-)
Just suggesting that we'd may get slightly less latency with
a = up[0] - m*vp[0];
count_trailing_zeros (cnt, a);
cy = mpn_addmul_1 (up, vp, n, m);
Maybe not so useful on it's own, but with assembly, there are further
opportunities.
Like doing rshift on-the-fly.
> This addmul_1-only, positive-only table is a strange gcd animal. It
> will never yield U = 0. Individual limbs are not unlikely to become
> zero.
Ooops. Then one definitely need a different stop condition than comparing
something to zero.
I tried the stop condition U = 0 first, and it seemed to run really
slowly. :-)
If your willing to have both count_leading_zeros and
count_trailing_zeros in the loop, maybe it's simpler and more efficient
with a table-based binary euclid?
CND_SWAP
count_leading_zeros(cnt, up[n-1] | vp[n-1])
q = tab[...high bits...]
if (q odd)
u <-- u - q v
else
u <-- (q+1) u - v
count_trailing_zeros(cnt, up[0]); /* At least one */
u >>= cnt;
I don't follow the u assignment in the else clause. And even*odd-odd
would seem to become odd.
Perhaps you meant u <-- u - (v - q)v?
But a table-based Euclid would be interesting!
On many machines, count_*ing_zeros are cheap. On AMD Ryzen, _leading_
has the same latency and throughput as plain addition (latency = 1,
throughput 4/cycle), while _trailing_ has twice the latency and half the
throughput. On Intel the latency is worse, 3 cycles for both trailing
and leading with a throughput of 1/cycle, but that's still not bad.
One might unroll Euclid 2x and only recompute _leading_ for the affected
operand.
Would you give your table-based Euclid a try? Please use
~tege/gcd/test-gcd_33.c for testing, replacing the _33 call with an _n
call. (I'll might make a soecial test_gcd_n.c at some point.)
It seems like an mpn_rsbmul_1 would be a useful primitive for several
variants.
Yep. I never even considered that in GMP until now.
Sounds like the opposite of binary euclid.
Perhaps!
--
Torbjörn
Please encrypt, key id 0xC8601622
More information about the gmp-devel
mailing list