mpz_prevprime

Seth Troisi braintwo at gmail.com
Mon Mar 9 01:56:33 UTC 2020


>From Marco Bodrato's analysis I got inspired and modified nextprime to uses
a much larger segmented sieve and tuned the prime limit for large inputs.
You can see some of my thresholds and testing input in this google
sheet[1]. The end result is sieve up ~ LOG2(N) ^ (20/7) / 3268
For 1000 bit numbers this sieves up to 114,000 (in 1e-5 seconds) and using
the resulting 11k primes for trial division.
For 10,000 bit numbers this sieves up to 80 million (which takes ~4 seconds
but reduces nextprime from ~300seconds to ~200 seconds)

Compared with the segmented sieve version this ends up being 1.3x-2.5x
faster for numbers > 500 bits.
I have detailed numbers justifying my calculations in the google sheet[1]
and colab python optimizer script[2]

My code I used to benchmark attached in the first patch: tune.patch
It prints some timing info at a number of different input sizes
$ hg import ~/Downloads/tune.patch --no-commit
$ cd ~/Projects/gmp/build/ && make -j8 >> /dev/null && cd tests/mpz && make
clean t-nextprime-tune >> /dev/null && time ./t-nextprime-tune
32, count: 14062, gap: 302639, 0.08919 seconds = avg 0.00001
64, count: 7031, gap: 302559, 0.10212 seconds = avg 0.00001
128, count: 3515, gap: 312423, 0.20790 seconds = avg 0.00006
256, count: 1757, gap: 304433, 0.57408 seconds = avg 0.00033
1000, count: 450, gap: 316709, 11.79883 seconds = avg 0.02622
2000, count: 225, gap: 313179, 80.45503 seconds = avg 0.35758

My improvements to next_prime (with debugging print statements) are
included in the 2nd patch nextprime_sieve_debug.patch
$ hg import ~/Downloads/nextprime_sieve_debug.patch --force --no-commit
32 bits, 16 primes  0.0000 , mod 0.0000  | => 2 tests
64 bits, 32 primes  0.0000 , mod 0.0000  | => 4 tests
128 bits, 64 primes  0.0000 , mod 0.0000  | => 1 tests
Pi(2324) = 342 | 256 bits, 342 primes  0.0000 , mod 0.0000  | => 8 tests
256, count: 1757, gap: 304433, 0.49184 seconds = avg 0.00028
Pi(114063) = 10792 | 1000 bits, 10792 primes  0.0002 , mod 0.0007  | => 59
tests
1000, count: 450, gap: 316709, 7.44536 seconds = avg 0.01655
Pi(826479) = 65910 | 2000 bits, 65910 primes  0.0012 , mod 0.0051  | => 18
tests
2000, count: 225, gap: 313179, 44.00031 seconds = avg 0.19556

I also tested with speed (thanks for helping me get that committed last
year)
$ hg revert mpz/nextprime.c
$ hg import ~/Downloads/nextprime_sieve_clean.patch --force --no-commit
$ cd build; make clean -j4; make -C tune/ speed -j4
$ ./tune/speed -s 300,400,500,750,1000 mpz_nextprime -P time_nextprime2
(run before change and save in time_nextprime)
$ pr -m -t time_nextprime.data time_nextprime2.data | awk '{ print
$0,"\t"$4/$2 }'
300 4.754023437e-04    300     4.111347656e-04 0.864814
400 1.221230469e-03    400     1.058787109e-03 0.866984
500 2.007132813e-03    500     1.590521484e-03 0.792435
750 8.607291016e-03    750     5.961019531e-03 0.692555
1000 2.284160547e-02    1000     1.475637695e-02 0.646031

I cleaned up the code without any of the debug printing in the 3rd patch
nextprime_sieve_clean.patch which I'd love people to consider for submission

[1]
https://docs.google.com/spreadsheets/d/1sVm0ZGxFIASj5drxznxjrtTNMGgNXbZ4BZukReQlPVM/edit?usp=sharing
[2]
https://colab.research.google.com/drive/1UZyBbiRfuhFwQxM_N_ahXH0nNy9b-hXP


On Fri, Mar 6, 2020 at 4:03 PM Seth Troisi <braintwo at gmail.com> wrote:

> Some more justification for my patch
>
> I wrote some code that replaces mpz_millerrabin with a static lookup and
> ran it for a number of values.
>
> Existing Code (with real mpz_millerrabin)
> 100, gap: 1533, 0.00164 seconds
> 200, gap: 2547, 0.00675 seconds
> 300, gap: 3757, 0.01161 seconds
> 500, gap: 3765, 0.02568 seconds
> 1000, gap: 20835, 0.67636 seconds
> 2000, gap: 23685, 6.03032 seconds
> 5000, gap: 58867, 167.65375 seconds
>
> With is_prime_lookup (e.g. nearly instant mpz_millerrabin)
> 100, gap: 1533, 0.00018 seconds
> 200, gap: 2547, 0.00045 seconds
> 300, gap: 3757, 0.00087 seconds
> 500, gap: 3765, 0.00085 seconds
> 1000, gap: 27393, 0.00580 seconds
> 2000, gap: 27711, 0.00557 seconds
> 5000, gap: 74443, 0.01320 seconds
>
> Code in nextprime_segment.patch (see previous message) with is_prime_lookup
> 100, gap: 1533, 0.00011 seconds
> 200, gap: 2547, 0.00015 seconds
> 300, gap: 3757, 0.00018 seconds
> 500, gap: 3765, 0.00023 seconds
> 1000, gap: 27393, 0.00061 seconds
> 2000, gap: 27711, 0.00070 seconds
> 5000, gap: 74443, 0.00198 seconds
>
> So the overhead from the existing code is less than
> 6% at 200 bits
> 3% at 500 bits
> 1% at 1000 bits.
>
> The new code improves this to
> 2% at 200 bits
> 1% at 500 bits
> 0.1% at 1000 bits
>
> I'm going to use this program to tune a new prime_limit on the range
> 300-5000 bits which is harder to tune with tune/speed.
>
>
> On Fri, Mar 6, 2020 at 1:55 PM Seth Troisi <braintwo at gmail.com> wrote:
>
>> I tried using a segmented sieve. this avoids multiplication or mod in the
>> inner loop.
>> It's an improvement and the code is easy to comprehend.
>>
>> $ ./tune/speed -s 100-300,500,1000 -f 1.1 mpz_nextprime -p mpz_nextprime
>> $ hg import ~/Downloads/nextprime_segment.patch --no-commit
>> $ make -j4; make -C tune/ speed -j4
>> $ ./tune/speed -s 100-300,500,1000 -f 1.1 mpz_nextprime -P mpz_nextprime2
>>
>> $ pr -m -t mpz_nextprime.data mpz_nextprime2.data | awk '{ print
>> $0,"\t"$4/$2 }'
>> 100 3.658593750e-05    100     3.326367187e-05 0.909193
>> 110 3.904296875e-05    110     3.598828125e-05 0.921761
>> 121 4.675585937e-05    121     4.157031250e-05 0.889093
>> 133 8.379882812e-05    133     7.977343750e-05 0.951964
>> 146 9.650195312e-05    146     8.803710937e-05 0.912283
>> 160 1.245546875e-04    160     1.032539062e-04 0.828985
>> 176 1.269101562e-04    176     1.173007813e-04 0.924282
>> 193 1.977187500e-04    193     1.846230469e-04 0.933766
>> 212 2.087812500e-04    212     1.933378906e-04 0.926031
>> 233 2.356660156e-04    233     2.218164063e-04 0.941232
>> 256 3.030214844e-04    256     2.868691406e-04 0.946696
>> 281 4.203593750e-04    281     3.908398437e-04 0.929775
>> 500 2.132431641e-03    500     2.013937500e-03 0.944432
>> 1000 2.307201758e-02    1000     2.284832031e-02 0.990304
>>
>> This can undoubtedly benefit from a larger sieve also,
>> which I'm testing in a different patch.
>>
>> On Thu, Mar 5, 2020 at 6:05 PM Seth Troisi <braintwo at gmail.com> wrote:
>>
>>> I tried using a segmented sieve. this avoids multiplication or mod in
>>> the inner loop.
>>> It's an improvement and the code is easy to comprehend.
>>>
>>>
>>>
>>>
>>> On Wed, Feb 12, 2020 at 10:25 AM Niels Möller <nisse at lysator.liu.se>
>>> wrote:
>>>
>>>> "Marco Bodrato" <bodrato at mail.dm.unipi.it> writes:
>>>>
>>>> > Ciao,
>>>> >
>>>> > Il Mer, 12 Febbraio 2020 7:26 am, Niels Möller ha scritto:
>>>> >> And for a small prime gap (common case), this cost should be the main
>>>> >> part of the sieving. So it would make sense to optimize it. If we
>>>> also
>>>> >> use the ptab, we could compute modulo single-limb prime products
>>>> using
>>>> >> precomputed constants.
>>>> >
>>>> > Yes, of course.
>>>>
>>>> Another version below, using ptab. Timing:
>>>>
>>>> $ ./tune/speed -s 100-300,500,1000 -f 1.1 mpz_nextprime
>>>>         mpz_nextprime
>>>> 100       0.000116762
>>>> 110       0.000126770
>>>> 121       0.000147203
>>>> 133       0.000257994
>>>> 146       0.000291051
>>>> 160       0.000343168
>>>> 176       0.000386605
>>>> 193       0.000625994
>>>> 212       0.000670035
>>>> 233       0.000763240
>>>> 256       0.000942154
>>>> 281       0.001418639
>>>> 500       0.006632588
>>>> 1000      0.073935992
>>>>
>>>> So also close to 15% speedup for 100 bits, and the same as previous
>>>> version, almost 25%, for 1000 bits.
>>>>
>>>> I used the very crude tuning
>>>>
>>>>   limit = (nbits < 200) ? nbits / 2 : PTAB_SIZE;
>>>>
>>>> to avoid regressing for smaller sizes.
>>>>
>>>> > And the next question will be: should we generate on the fly an even
>>>> > larger table of primes when the required size grows beyond the
>>>> > precomputed table?
>>>>
>>>> Don't know. I don't quite understand how large a table could be
>>>> beneficial, but I guess it might make sense to at least limit it to size
>>>> of L1 cache.
>>>>
>>>> Regards,
>>>> /Niels
>>>>
>>>> /* mpz_nextprime(p,t) - compute the next prime > t and store that in p.
>>>>
>>>> Copyright 1999-2001, 2008, 2009, 2012 Free Software Foundation, Inc.
>>>>
>>>> Contributed to the GNU project by Niels Möller and Torbjorn Granlund.
>>>>
>>>> This file is part of the GNU MP Library.
>>>>
>>>> The GNU MP Library is free software; you can redistribute it and/or
>>>> modify
>>>> it under the terms of either:
>>>>
>>>>   * the GNU Lesser General Public License as published by the Free
>>>>     Software Foundation; either version 3 of the License, or (at your
>>>>     option) any later version.
>>>>
>>>> or
>>>>
>>>>   * the GNU General Public License as published by the Free Software
>>>>     Foundation; either version 2 of the License, or (at your option) any
>>>>     later version.
>>>>
>>>> or both in parallel, as here.
>>>>
>>>> The GNU MP Library is distributed in the hope that it will be useful,
>>>> but
>>>> WITHOUT ANY WARRANTY; without even the implied warranty of
>>>> MERCHANTABILITY
>>>> or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
>>>> for more details.
>>>>
>>>> You should have received copies of the GNU General Public License and
>>>> the
>>>> GNU Lesser General Public License along with the GNU MP Library.  If
>>>> not,
>>>> see https://www.gnu.org/licenses/.  */
>>>>
>>>> #include "gmp-impl.h"
>>>> #include "longlong.h"
>>>>
>>>> #if 1
>>>> struct gmp_primes_dtab {
>>>>   mp_limb_t p;
>>>>   mp_limb_t binv;
>>>>   mp_limb_t lim;
>>>> };
>>>>
>>>> struct gmp_primes_ptab {
>>>>   mp_limb_t ppp;        /* primes, multiplied together */
>>>>   mp_limb_t cps[7];     /* ppp values pre-computed for mpn_mod_1s_4p */
>>>>   gmp_uint_least32_t idx:24;    /* index of  first primes in dtab */
>>>>   gmp_uint_least32_t np :8;     /* number of primes related to this
>>>> entry */
>>>> };
>>>>
>>>> static const struct gmp_primes_dtab dtab[] =
>>>> {
>>>> #define WANT_dtab
>>>> #define P(p,inv,lim) {p, inv,lim}
>>>> #include "trialdivtab.h"
>>>> #undef WANT_dtab
>>>> #undef P
>>>> };
>>>>
>>>> #define DTAB_SIZE (sizeof (dtab) / sizeof (dtab[0]))
>>>>
>>>> static const struct gmp_primes_ptab ptab[] =
>>>> {
>>>> #define WANT_ptab
>>>> #include "trialdivtab.h"
>>>> #undef WANT_ptab
>>>> };
>>>>
>>>> #define PTAB_SIZE (sizeof (ptab) / sizeof (ptab[0]))
>>>>
>>>> void
>>>> mpz_nextprime (mpz_ptr p, mpz_srcptr n)
>>>> {
>>>>   mp_limb_t *moduli;
>>>>   unsigned long difference;
>>>>   int i;
>>>>   unsigned prime_limit;
>>>>   mp_bitcnt_t nbits;
>>>>   unsigned incr;
>>>>   TMP_DECL;
>>>>
>>>>   /* First handle tiny numbers */
>>>>   if (mpz_cmp_ui (n, 2) < 0)
>>>>     {
>>>>       mpz_set_ui (p, 2);
>>>>       return;
>>>>     }
>>>>   mpz_add_ui (p, n, 1);
>>>>   mpz_setbit (p, 0);
>>>>
>>>>   if (mpz_cmp_ui (p, 7) <= 0)
>>>>     return;
>>>>
>>>>   MPN_SIZEINBASE_2EXP(nbits, PTR(p), SIZ(p), 1);
>>>>
>>>>   if (nbits < 200)
>>>>     {
>>>>       prime_limit = nbits / 2;
>>>>       if (SIZ(p) == 1)
>>>>         {
>>>>           /* If the prime list includes any prime >= p, do plain
>>>>              search. */
>>>>           mp_limb_t limb = PTR(p)[0];
>>>>           unsigned last = ptab[prime_limit-1].idx +
>>>> ptab[prime_limit-1].np - 1;
>>>>           if (dtab[last].p >= limb)
>>>>             {
>>>>               while (dtab[last-1].p >= limb)
>>>>                 last--;
>>>>               PTR(p)[0] = dtab[last].p;
>>>>               return;
>>>>             }
>>>>         }
>>>>     }
>>>>   else
>>>>     prime_limit = PTAB_SIZE;
>>>>
>>>>   TMP_MARK;
>>>>
>>>>   /* Compute residues modulo small odd primes */
>>>>   moduli = TMP_ALLOC_TYPE (prime_limit, mp_limb_t);
>>>>
>>>>   for (;;)
>>>>     {
>>>>       for (i = 0; i < prime_limit; i++)
>>>>         {
>>>>           mp_limb_t ppp = ptab[i].ppp;
>>>>           const mp_limb_t *cps = ptab[i].cps;
>>>>           moduli[i] = mpn_mod_1s_4p (PTR(p), SIZ(p), ppp << cps[1],
>>>> cps);
>>>>         }
>>>>
>>>> #define INCR_LIMIT 0x10000      /* deep science */
>>>>
>>>>       for (difference = incr = 0; incr < INCR_LIMIT; difference += 2)
>>>>         {
>>>>           /* First check divisibility based on prime list */
>>>>           for (i = 0; i < prime_limit; i++)
>>>>             {
>>>>               /* FIXME: Ensure that this addition can't overflow */
>>>>               mp_limb_t m = moduli[i] + incr;
>>>>               int idx = ptab[i].idx;
>>>>               int np = ptab[i].np;
>>>>               int j;
>>>>
>>>>               for (j = 0; j < np; j++)
>>>>                 if (m * dtab[idx+j].binv <= dtab[idx+j].lim)
>>>>                   goto next;
>>>>             }
>>>>
>>>>           mpz_add_ui (p, p, difference);
>>>>           difference = 0;
>>>>
>>>>           /* Miller-Rabin test */
>>>>           if (mpz_millerrabin (p, 25))
>>>>             goto done;
>>>>         next:;
>>>>           incr += 2;
>>>>         }
>>>>       mpz_add_ui (p, p, difference);
>>>>       difference = 0;
>>>>     }
>>>>  done:
>>>>   TMP_FREE;
>>>> }
>>>> #else
>>>> static const unsigned char primegap[] =
>>>> {
>>>>   2,2,4,2,4,2,4,6,2,6,4,2,4,6,6,2,6,4,2,6,4,6,8,4,2,4,2,4,14,4,6,
>>>>
>>>> 2,10,2,6,6,4,6,6,2,10,2,4,2,12,12,4,2,4,6,2,10,6,6,6,2,6,4,2,10,14,4,2,
>>>>   4,14,6,10,2,4,6,8,6,6,4,6,8,4,8,10,2,10,2,6,4,6,8,4,2,4,12,8,4,8,4,6,
>>>>
>>>> 12,2,18,6,10,6,6,2,6,10,6,6,2,6,6,4,2,12,10,2,4,6,6,2,12,4,6,8,10,8,10,8,
>>>>
>>>> 6,6,4,8,6,4,8,4,14,10,12,2,10,2,4,2,10,14,4,2,4,14,4,2,4,20,4,8,10,8,4,6,
>>>>   6,14,4,6,6,8,6,12
>>>> };
>>>>
>>>> #define NUMBER_OF_PRIMES 167
>>>>
>>>> void
>>>> mpz_nextprime (mpz_ptr p, mpz_srcptr n)
>>>> {
>>>>   unsigned short *moduli;
>>>>   unsigned long difference;
>>>>   int i;
>>>>   unsigned prime_limit;
>>>>   unsigned long prime;
>>>>   mp_size_t pn;
>>>>   mp_bitcnt_t nbits;
>>>>   unsigned incr;
>>>>   TMP_SDECL;
>>>>
>>>>   /* First handle tiny numbers */
>>>>   if (mpz_cmp_ui (n, 2) < 0)
>>>>     {
>>>>       mpz_set_ui (p, 2);
>>>>       return;
>>>>     }
>>>>   mpz_add_ui (p, n, 1);
>>>>   mpz_setbit (p, 0);
>>>>
>>>>   if (mpz_cmp_ui (p, 7) <= 0)
>>>>     return;
>>>>
>>>>   pn = SIZ(p);
>>>>   MPN_SIZEINBASE_2EXP(nbits, PTR(p), pn, 1);
>>>>   if (nbits / 2 >= NUMBER_OF_PRIMES)
>>>>     prime_limit = NUMBER_OF_PRIMES - 1;
>>>>   else
>>>>     prime_limit = nbits / 2;
>>>>
>>>>   TMP_SMARK;
>>>>
>>>>   /* Compute residues modulo small odd primes */
>>>>   moduli = TMP_SALLOC_TYPE (prime_limit, unsigned short);
>>>>
>>>>   for (;;)
>>>>     {
>>>>       /* FIXME: Compute lazily? */
>>>>       prime = 3;
>>>>       for (i = 0; i < prime_limit; i++)
>>>>         {
>>>>           moduli[i] = mpz_tdiv_ui (p, prime);
>>>>           prime += primegap[i];
>>>>         }
>>>>
>>>> #define INCR_LIMIT 0x10000      /* deep science */
>>>>
>>>>       for (difference = incr = 0; incr < INCR_LIMIT; difference += 2)
>>>>         {
>>>>           /* First check residues */
>>>>           prime = 3;
>>>>           for (i = 0; i < prime_limit; i++)
>>>>             {
>>>>               unsigned r;
>>>>               /* FIXME: Reduce moduli + incr and store back, to allow
>>>> for
>>>>                  division-free reductions.  Alternatively, table
>>>> primes[]'s
>>>>                  inverses (mod 2^16).  */
>>>>               r = (moduli[i] + incr) % prime;
>>>>               prime += primegap[i];
>>>>
>>>>               if (r == 0)
>>>>                 goto next;
>>>>             }
>>>>
>>>>           mpz_add_ui (p, p, difference);
>>>>           difference = 0;
>>>>
>>>>           /* Miller-Rabin test */
>>>>           if (mpz_millerrabin (p, 25))
>>>>             goto done;
>>>>         next:;
>>>>           incr += 2;
>>>>         }
>>>>       mpz_add_ui (p, p, difference);
>>>>       difference = 0;
>>>>     }
>>>>  done:
>>>>   TMP_SFREE;
>>>> }
>>>>
>>>> #endif
>>>>
>>>> --
>>>> Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
>>>> Internet email is subject to wholesale government surveillance.
>>>>
>>>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tune.patch
Type: text/x-patch
Size: 3090 bytes
Desc: not available
URL: <https://gmplib.org/list-archives/gmp-devel/attachments/20200308/749494ba/attachment-0003.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: nextprime_sieve_debug.patch
Type: text/x-patch
Size: 8056 bytes
Desc: not available
URL: <https://gmplib.org/list-archives/gmp-devel/attachments/20200308/749494ba/attachment-0004.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: nextprime_sieve_clean.patch
Type: text/x-patch
Size: 7311 bytes
Desc: not available
URL: <https://gmplib.org/list-archives/gmp-devel/attachments/20200308/749494ba/attachment-0005.bin>


More information about the gmp-devel mailing list