mpz_prevprime
Seth Troisi
braintwo at gmail.com
Tue Mar 24 00:10:00 UTC 2020
I didn't see your proposed patch on the March 21st email, we use different
tricks which might be possible to combine.
You suggestion to avoiding miller_rabin by using trial division is where
the core of the speedup for both of our code is coming from.
We both use int instead of mpz_t to avoid the overhead of mpz_cdiv_ui which
I believe is an important speedup but not the most important.
(I'm don't think your ulong patch is needed but I can measure at a later
point)
My code doesn't use a sieve (and instead checks each number for divisors
similar to the old code) because the average gap for small numbers is small
300000 / primepi(300000) = 11
Because the measured threshold for this was much larger (~80 million), it
might actually make sense to use a sieve after some threshold
e.g. checking ~1000 primes 11 times is more expensive than marking off
multiple numbers for the smaller primes.
If you think that's a 1.5-3x speedup for inputs 16-30bits is important I
can try to dynamically generate a longer list of primes
My hesitation with dynamically generation primes is that mpz_nextprime only
takes ~ 4e-6* seconds for 30 bit inputs on my computer
and it will take non-trivial work to sieve all the primes up to 1X,000.
* tune/speed's output which I believe has the units seconds and can be read
directly but that's an assumption.
My suggestion is:
* Commit my code; it's simple and get's a huge speedup (over both the
existing and your code) for <16 bit inputs
unfortunately this can't be tuned above the static list size ** 2
If you want to try with a dynamical list (e.g. it's important to get a
1.5-3x speedup on 16-30 bit inputs)
* Commit my code with a new tunable THRESHOLD_TRAIL_DIV (I'd guess
X,000,000) and possibly extending the list to 300 primes
* I'll modify your code to dynamically generate primes up to X and we'll
tune a 2nd THRESHOLD_TRAIL_DIV_SIEVE (I'd guess ~150,000,000)
On Mon, Mar 23, 2020 at 2:38 PM Marco Bodrato <bodrato at mail.dm.unipi.it>
wrote:
> Ciao,
>
> Il 2020-03-22 10:00 Seth Troisi ha scritto:
> > I measure the regression as ~5-15% for input < ~10,000
> And no regression for larger input? Good.
>
> > I have a proposed patch which uses int, and trial div up to some
> > threshold this appears to be 5x faster.
>
> A couple of question arises.
> 5x faster than what? Than the patch I proposed or than the old/current
> code?
>
Faster than the old code and your code
==> time_nextprime_pre.data <==
10 2.153700000e-06
11 2.143721467e-06
12 2.132065283e-06
13 2.143785751e-06
14 2.132306615e-06
15 2.161254031e-06
16 2.234000000e-06
17 2.291227474e-06
==> time_nextprime_seth.data <==
10 5.253385820e-07
11 4.980276023e-07
12 4.826503269e-07
13 4.860404419e-07
14 4.833486686e-07
15 5.412016226e-07
16 6.028669974e-07
17 6.123738235e-07
==> time_nextprime_marco.data <==
10 1.626300000e-06
11 1.580771135e-06
12 1.623004920e-06
13 1.572217889e-06
14 1.647112356e-06
15 1.636840921e-06
16 1.742080000e-06
17 1.966933016e-06
>
> Which one is the good idea? using int or avoiding the sieve?
>
See above
> Using unsigned long to compute the modulus is obviously faster than many
> calls to mpz_cdiv_ui. I tried a patch that surely is not as fast as
> yours, but should speed-up all numbers that fits in a single unsigned
> long. The speed-up seems small, I'm not sure it's worth.
>
> > The cut off point as I measure it is ~80,000,000 but requires
> > increasing the primegap_small list to ~1100 entries
> >
> > I'm not sure what you think about a list of that length.
>
> Well, the cut off point is surely different on different architectures.
> It should be tuned.
> Moreover: a large list ... can't it be generated on the fly? If we want
> a larger table, in my opinion it should be an improvement for all the
> functions that need primes (_primorial, _bin, _fac...).
I'm not sure I'm up to that task right now. see above for my suggestions
>
> > It might also be useful to commit tune_nextprime_small which
> > dynamically selects a higher reps count for
> > mpz_nextprime input and helps increase precision.
>
> Uhm...
> j = 200000 / s->size;
>
> This means that
> for size=10 you compute 20000 primes: all primes between 2^10 and 2^18.
> for size=11: all primes between 2^11 and 2^18.
> for size=12... and so on up to 2^18.
>
> This is not exactly representative of the speed of nextprime on operands
> around 2^10, 2^11... and so on.
>
> This is a great point, I modified the code so It only changes the outer
loop count.
I included a screenshot showing how much this reduces variance over
multiple runs.
>
> PS: a comment in the current code says
> /* Larger threshold is faster but takes ~O(n/ln(n)) memory.
> To be honest, the array sieve needs n/24 bytes, and the array
> primegap_tmp uses primepi(n) ~= n/ln(n) bytes. And asymptotically the
> sum of both is O(n).
>
You are correct, let's change the comment by which I mean you :)
> Ĝis,
> m
>
Again thanks for the great comments.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tune_nextprime_small.patch
Type: text/x-patch
Size: 1542 bytes
Desc: not available
URL: <https://gmplib.org/list-archives/gmp-devel/attachments/20200323/cbfaa440/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screenshot from 2020-03-23 17-08-14.png
Type: image/png
Size: 51464 bytes
Desc: not available
URL: <https://gmplib.org/list-archives/gmp-devel/attachments/20200323/cbfaa440/attachment-0001.png>
More information about the gmp-devel
mailing list