mpz_prevprime

Tue Mar 24 00:10:00 UTC 2020

I didn't see your proposed patch on the March 21st email, we use different
tricks which might be possible to combine.

You suggestion to avoiding miller_rabin by using trial division is where
the core of the speedup for both of our code is coming from.
We both use int instead of mpz_t to avoid the overhead of mpz_cdiv_ui which
I believe is an important speedup but not the most important.
(I'm don't think your ulong patch is needed but I can measure at a later
point)

My code doesn't use a sieve (and instead checks each number for divisors
similar to the old code) because the average gap for small numbers is small
300000 / primepi(300000) = 11
Because the measured threshold for this was much larger (~80 million), it
might actually make sense to use a sieve after some threshold
e.g. checking ~1000 primes 11 times is more expensive than marking off
multiple numbers for the smaller primes.

If you think that's a 1.5-3x speedup for inputs 16-30bits is important I
can try to dynamically generate a longer list of primes
My hesitation with dynamically generation primes is that mpz_nextprime only
takes ~ 4e-6*  seconds for 30 bit inputs on my computer
and it will take non-trivial work to sieve all the primes up to 1X,000.

* tune/speed's output which I believe has the units seconds and can be read
directly but that's an assumption.

My suggestion is:
* Commit my code; it's simple and get's a huge speedup (over both the
existing and your code) for <16 bit inputs
unfortunately this can't be tuned above the static list size ** 2

If you want to try with a dynamical list (e.g. it's important to get a
1.5-3x speedup on 16-30 bit inputs)
* Commit my code with a new tunable THRESHOLD_TRAIL_DIV (I'd guess
X,000,000) and possibly extending the list to 300 primes
* I'll modify your code to dynamically generate primes up to X and we'll
tune a 2nd THRESHOLD_TRAIL_DIV_SIEVE (I'd guess ~150,000,000)

On Mon, Mar 23, 2020 at 2:38 PM Marco Bodrato <bodrato at mail.dm.unipi.it>
wrote:

> Ciao,
>
> Il 2020-03-22 10:00 Seth Troisi ha scritto:
> > I measure the regression as ~5-15% for input < ~10,000

> And no regression for larger input? Good.
>

> > I have a proposed patch which uses int, and trial div up to some
> > threshold this appears to be 5x faster.
>
> A couple of question arises.
> 5x faster than what? Than the patch I proposed or than the old/current
> code?
>
Faster than the old code and your code
==> time_nextprime_pre.data <==
10       2.153700000e-06
11       2.143721467e-06
12       2.132065283e-06
13       2.143785751e-06
14       2.132306615e-06
15       2.161254031e-06
16       2.234000000e-06
17       2.291227474e-06

==> time_nextprime_seth.data <==
10       5.253385820e-07
11       4.980276023e-07
12       4.826503269e-07
13       4.860404419e-07
14       4.833486686e-07
15       5.412016226e-07
16       6.028669974e-07
17       6.123738235e-07

==> time_nextprime_marco.data <==
10       1.626300000e-06
11       1.580771135e-06
12       1.623004920e-06
13       1.572217889e-06
14       1.647112356e-06
15       1.636840921e-06
16       1.742080000e-06
17       1.966933016e-06

>
> Which one is the good idea? using int or avoiding the sieve?
>
See above

> Using unsigned long to compute the modulus is obviously faster than many
> calls to mpz_cdiv_ui. I tried a patch that surely is not as fast as
> yours, but should speed-up all numbers that fits in a single unsigned
> long. The speed-up seems small, I'm not sure it's worth.
>
> > The cut off point as I measure it is ~80,000,000 but requires
> > increasing the primegap_small list to ~1100 entries
> >
> > I'm not sure what you think about a list of that length.
>
> Well, the cut off point is surely different on different architectures.
> It should be tuned.
> Moreover: a large list ... can't it be generated on the fly? If we want
> a larger table, in my opinion it should be an improvement for all the
> functions that need primes (_primorial, _bin, _fac...).

I'm not sure I'm up to that task right now. see above for my suggestions

>

> > It might also be useful to commit tune_nextprime_small which
> > dynamically selects a higher reps count for
> > mpz_nextprime input and helps increase precision.
>
> Uhm...
> j = 200000 / s->size;
>
> This means that
> for size=10 you compute 20000 primes: all primes between 2^10 and 2^18.
> for size=11: all primes between 2^11 and 2^18.
> for size=12... and so on up to 2^18.
>
> This is not exactly representative of the speed of nextprime on operands
> around 2^10, 2^11... and so on.
>
> This is a great point, I modified the code so It only changes the outer
loop count.
I included a screenshot showing how much this reduces variance over
multiple runs.

>
> PS: a comment in the current code says
>   /* Larger threshold is faster but takes ~O(n/ln(n)) memory.
> To be honest, the array sieve needs n/24 bytes, and the array
> primegap_tmp uses primepi(n) ~= n/ln(n) bytes. And asymptotically the
> sum of both is O(n).
>
You are correct, let's change the comment by which I mean you :)

> Ĝis,
> m
>
Again thanks for the great comments.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tune_nextprime_small.patch
Type: text/x-patch
Size: 1542 bytes
Desc: not available
URL: <https://gmplib.org/list-archives/gmp-devel/attachments/20200323/cbfaa440/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screenshot from 2020-03-23 17-08-14.png
Type: image/png
Size: 51464 bytes
Desc: not available
URL: <https://gmplib.org/list-archives/gmp-devel/attachments/20200323/cbfaa440/attachment-0001.png>