Small operands gcd improvements
Torbjörn Granlund
tg at gmplib.org
Tue Aug 6 19:08:23 UTC 2019
nisse at lysator.liu.se (Niels Möller) writes:
Right, mpn_gcd usually ends with a call to mpn_gcd_1 or gcd_2, and the
latter also usually ends with a call to gcd_1. But I think it's easiest
to leave that as is until we have good gcd_22.
OK, but let's not forget about it!
> I expect asm gcd_1 to disappear as the C code should be equivalent. Do
> you agree?
That would be nice. There will be one more function call, but hopefully
that's not going to be a significant performance regression.
Perhaps it would be worthwhile to do a tailcall for when the leftshift
is 0? I think most gcd calls, irrespective of operand size, will not
have any factor of two, let alone a common one!
> I suppose some hardwired stuff for the case u >> v (not bitshift, the
> mathematical meaning of >>!) might want to be parameterised and also
> ideally tune/tuneup'ed.
Should that be done by gcd_1 only? Or do we need some variant of gcd_11
with an initial division?
I suppose we should have a lowest level without that (conditional or
unconditional) and that that level is gcd_11.
My workstation (intel broadwell) uses takes 96 cycles for a division (if
I read https://gmplib.org/~tege/x86-timing.pdf, or is there are faster
64/64 div isntruction?). And gcd_11 runs at roughly 4 cycles per input
bit according to speed. So then threshold should be around 24.
Modern Intel processors have different code for 128b/64b and 64b/64b
division. The former is very slow while the latter apparently uses a
high-radix SRT.
AMD Ryzen has fast 128b/64b (SRT-4 I think)..
Below patch to add a gcd_11 entrypoint for this arch. Passes make check,
but would be good to also test with devel/try.
Ah, so you suggest that we keep asm gcd_1 after all. I was in the
editing process of removing them! :-)
--
Torbjörn
Please encrypt, key id 0xC8601622
More information about the gmp-devel
mailing list