tg at gmplib.org
Tue Sep 3 19:35:45 UTC 2019
nisse at lysator.liu.se (Niels Möller) writes:
Maybe it makes sense to check this in now, deleting the old div1 and
div2? If it turns out to be a regression (is that easy to see from
nightly tests?), we can consider putting it back under some #if
DIV1_METHOD ..., or replace with some other method, q-by-table-lookup,
or do it bitwise, a <-- a - 2^k b. The current div1 and div2 seem
unlikely to be the right way to do it.
I don't know of any nightly test which would flag that.
As a matter of fact, we don't have any performance regression tests. We
should. It is not easy to do.
I don't think we even measure it, except perhaps indirectly for
determining some _THRESHOLD.
We might want to teach tuneup to choose between small quotient division
primitives., like we do for Jacobi.
./tune/speed -c -p 100000 -s3-5 mpn_gcd
overhead 5.92 cycles, precision 100000 units of 2.86e-10 secs, CPU freq
So for n = 3, cycles per input bit is reduced from 22.5 to 19.5 (on the
broadwell machine, around 16). How does that compare to mpn_gcd_33, I'd
imagine it's considerably faster?
C AMD bd4 8.6
C AMD zn1 7.3
C AMD zn2 6.6
C Intel HWL 10.3
C Intel BWL 8.0
C Intel SKL 7.6
I only wrote one variant, and that variant uses shlx and shrx, which
only the CPUs above support. The system 'shell' does not run it (but
Ivy bridge, which 'shell' uses, is slightly slower than hwl (haswell).
We can estinate that gcd_33 is around 1.75x faster than the new and
improved mpn_gcd. I believe using / on these processors is not the
right approach and that some table based division or my previously
suggested division (which handles q <= 7 with some compares and masking)
could shave off another 25%...
Please encrypt, key id 0xC8601622
More information about the gmp-devel