hgcd1/2

Tue Sep 3 15:59:10 UTC 2019

tg at gmplib.org (Torbjörn Granlund) writes:

> nisse at lysator.liu.se (Niels Möller) writes:
>
>   Here's a div2 function that does that. Together with the deleted div1,
>   it gives a 25% (!) speed up of gcd in the range 3-10 limbs, benchmarked
>   on my broadwell machine. And should do well on all machines with decent
>   small-quotient division.
>
> Cool!
>
> And broadwell (while not on my list) presumably have the same slowish
> division as haswell and skylake, as it came between those chips.
>
> Amd and ARM processors should get an even greater boost.

Maybe it makes sense to check this in now, deleting the old div1 and
div2? If it turns out to be a regression (is that easy to see from
nightly tests?), we can consider putting it back under some #if
DIV1_METHOD ..., or replace with some other method, q-by-table-lookup,
or do it bitwise, a <-- a - 2^k b. The current div1 and div2 seem
unlikely to be the right way to do it.

> I am curious how these GCD implementations compare:
>
> 1. The gcd_33 attached for its only use case n = 3.  It runs on Haswell
>    and newer and on Ryzen (but also on bd4 aka Excavator)

On shell:

Before:

 ./tune/speed -c -p 100000 -s3-5 mpn_gcd
overhead 5.92 cycles, precision 100000 units of 2.86e-10 secs, CPU freq
3500.09 MHz
              mpn_gcd
3             4328.73
4             6317.43
5             8320.82

After:

              mpn_gcd
3             3768.16
4             5509.65
5             7182.58

So for n = 3, cycles per input bit is reduced from 22.5 to 19.5 (on the
broadwell machine, around 16). How does that compare to mpn_gcd_33, I'd
imagine it's considerably faster?

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.