hgcd2_jacobi

Niels Möller nisse at lysator.liu.se
Wed Jan 22 14:00:22 UTC 2020


Hi,

with the recent gcd improvements, we optimized the mpn_hgcd2 function
and tuned the way we compute small quotients. The function
mpn_hgcd2_jacobi does almost the same thing, but with updates of the
jacobi state (a few bits) for each quotient. I've tried simply copying
all the new div1 and div2 variants to hgcd2_jacobi.c, reusing the same
HGCD2_DIV?_METHOD tuning variables, and adjusting it accordingly.

Before:

$ ./tune/speed -s 1-10 -r mpn_gcd mpz_jacobi
overhead 0.000000005 secs, precision 10000 units of 8.33e-10 secs, CPU
freq 1200.18 MHz
              mpn_gcd    mpz_jacobi
1        #0.000000304        1.6741
2        #0.000000754        1.8380
3        #0.000003301        1.3186
4        #0.000004953        1.2960
5        #0.000006564        1.2935
6        #0.000008255        1.2745
7        #0.000009877        1.2769
8        #0.000011591        1.2563
9        #0.000013175        1.2726
10       #0.000014881        1.2649

After:

$ ./tune/speed -s 1-10 -r mpn_gcd mpz_jacobi
overhead 0.000000005 secs, precision 10000 units of 8.33e-10 secs, CPU
freq 1200.51 MHz
              mpn_gcd    mpz_jacobi
1        #0.000000304        1.6773
2        #0.000000754        1.8374
3        #0.000003293        1.1143
4        #0.000004959        1.0864
5        #0.000006567        1.0713
6        #0.000008269        1.0718
7        #0.000009870        1.0612
8        #0.000011570        1.0558
9        #0.000013191        1.0606
10       #0.000014864        1.0534

So there's clearly room for improvement, but I don't quite like to
duplicate all of the somewhat complex div1/div2 code (~450 lines).
Options:

1. Move div1/div2 to gmp-impl.h. We may need to move tables used for
   div1 method 4 and method 5 to a new source file.

2. Let hgcd_jacobi.c include hgcd2.c, with preprocessor hacks, e.g,
   sprinkling some

     IF_JACOBI (bits = mpn_jacobi_update (bits, 1, 1));

   throughout the hgcd2 function. That eliminates more of the source
   code duplication than just div1/div2, which has its own pros and
   cons.

3. Just duplicate it all.

I also wonder if we should keep the method 4 div1? It's currently
selected only in a single configuration, x86/bd1/gmp-mparam.h, but
in the latest measurement on
https://gmplib.org/devel/thres/HGCD2_DIV1_METHOD.html it's beaten by
method 3. There are half a dozen gmp-mparam.h that claim it's second
best, from

$ find -name gmp-mparam.h -exec grep -H HGCD2_DIV1_METHOD '{}' ';' |grep 'than 4'

I get

x86_64/k10:#define HGCD2_DIV1_METHOD    5  /* 8.38% faster than 4 */
x86_64/bd1:#define HGCD2_DIV1_METHOD    3  /* 2.31% faster than 4 */
x86_64/atom:#define HGCD2_DIV1_METHOD   3  /* 5.86% faster than 4 */
arm/v7a/cora8:#define HGCD2_DIV1_METHOD 5  /* 4.29% faster than 4 */
x86/k10:#define HGCD2_DIV1_METHOD       3  /* 0.70% faster than 4 */
x86/coreisbr:#define HGCD2_DIV1_METHOD  1  /* 5.54% faster than 4 */
x86/k7:#define HGCD2_DIV1_METHOD        3  /* 3.84% faster than 4 */

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.



More information about the gmp-devel mailing list