hgcd1/2

Tue Sep 3 19:35:45 UTC 2019

nisse at lysator.liu.se (Niels Möller) writes:

  Maybe it makes sense to check this in now, deleting the old div1 and
  div2? If it turns out to be a regression (is that easy to see from
  nightly tests?), we can consider putting it back under some #if
  DIV1_METHOD ..., or replace with some other method, q-by-table-lookup,
  or do it bitwise, a <-- a - 2^k b. The current div1 and div2 seem
  unlikely to be the right way to do it.

I don't know of any nightly test which would flag that.

As a matter of fact, we don't have any performance regression tests.  We
should.  It is not easy to do.

I don't think we even measure it, except perhaps indirectly for
determining some _THRESHOLD.

We might want to teach tuneup to choose between small quotient division
primitives., like we do for Jacobi.

  On shell:

  Before:

   ./tune/speed -c -p 100000 -s3-5 mpn_gcd
  overhead 5.92 cycles, precision 100000 units of 2.86e-10 secs, CPU freq
  3500.09 MHz
                mpn_gcd
  3             4328.73
  4             6317.43
  5             8320.82

  After:

                mpn_gcd
  3             3768.16
  4             5509.65
  5             7182.58

  So for n = 3, cycles per input bit is reduced from 22.5 to 19.5 (on the
  broadwell machine, around 16). How does that compare to mpn_gcd_33, I'd
  imagine it's considerably faster?

C AMD bd4	 8.6
C AMD zn1	 7.3
C AMD zn2	 6.6
C Intel HWL	 10.3
C Intel BWL	 8.0
C Intel SKL	 7.6

I only wrote one variant, and that variant uses shlx and shrx, which
only the CPUs above support.  The system 'shell' does not run it (but
ashell does!).

Ivy bridge, which 'shell' uses, is slightly slower than hwl (haswell).
We can estinate that gcd_33 is around 1.75x faster than the new and
improved mpn_gcd.  I believe using / on these processors is not the
right approach and that some table based division or my previously
suggested division (which handles q <= 7 with some compares and masking)
could shave off another 25%...

-- 
Torbjörn
Please encrypt, key id 0xC8601622