mpz_prevprime

Sat Mar 7 00:03:59 UTC 2020

Some more justification for my patch

I wrote some code that replaces mpz_millerrabin with a static lookup and
ran it for a number of values.

Existing Code (with real mpz_millerrabin)
100, gap: 1533, 0.00164 seconds
200, gap: 2547, 0.00675 seconds
300, gap: 3757, 0.01161 seconds
500, gap: 3765, 0.02568 seconds
1000, gap: 20835, 0.67636 seconds
2000, gap: 23685, 6.03032 seconds
5000, gap: 58867, 167.65375 seconds

With is_prime_lookup (e.g. nearly instant mpz_millerrabin)
100, gap: 1533, 0.00018 seconds
200, gap: 2547, 0.00045 seconds
300, gap: 3757, 0.00087 seconds
500, gap: 3765, 0.00085 seconds
1000, gap: 27393, 0.00580 seconds
2000, gap: 27711, 0.00557 seconds
5000, gap: 74443, 0.01320 seconds

Code in nextprime_segment.patch (see previous message) with is_prime_lookup
100, gap: 1533, 0.00011 seconds
200, gap: 2547, 0.00015 seconds
300, gap: 3757, 0.00018 seconds
500, gap: 3765, 0.00023 seconds
1000, gap: 27393, 0.00061 seconds
2000, gap: 27711, 0.00070 seconds
5000, gap: 74443, 0.00198 seconds

So the overhead from the existing code is less than
6% at 200 bits
3% at 500 bits
1% at 1000 bits.

The new code improves this to
2% at 200 bits
1% at 500 bits
0.1% at 1000 bits

I'm going to use this program to tune a new prime_limit on the range
300-5000 bits which is harder to tune with tune/speed.

On Fri, Mar 6, 2020 at 1:55 PM Seth Troisi <braintwo at gmail.com> wrote:

> I tried using a segmented sieve. this avoids multiplication or mod in the
> inner loop.
> It's an improvement and the code is easy to comprehend.
>
> $ ./tune/speed -s 100-300,500,1000 -f 1.1 mpz_nextprime -p mpz_nextprime
> $ hg import ~/Downloads/nextprime_segment.patch --no-commit
> $ make -j4; make -C tune/ speed -j4
> $ ./tune/speed -s 100-300,500,1000 -f 1.1 mpz_nextprime -P mpz_nextprime2
>
> $ pr -m -t mpz_nextprime.data mpz_nextprime2.data | awk '{ print
> $0,"\t"$4/$2 }'
> 100 3.658593750e-05    100     3.326367187e-05 0.909193
> 110 3.904296875e-05    110     3.598828125e-05 0.921761
> 121 4.675585937e-05    121     4.157031250e-05 0.889093
> 133 8.379882812e-05    133     7.977343750e-05 0.951964
> 146 9.650195312e-05    146     8.803710937e-05 0.912283
> 160 1.245546875e-04    160     1.032539062e-04 0.828985
> 176 1.269101562e-04    176     1.173007813e-04 0.924282
> 193 1.977187500e-04    193     1.846230469e-04 0.933766
> 212 2.087812500e-04    212     1.933378906e-04 0.926031
> 233 2.356660156e-04    233     2.218164063e-04 0.941232
> 256 3.030214844e-04    256     2.868691406e-04 0.946696
> 281 4.203593750e-04    281     3.908398437e-04 0.929775
> 500 2.132431641e-03    500     2.013937500e-03 0.944432
> 1000 2.307201758e-02    1000     2.284832031e-02 0.990304
>
> This can undoubtedly benefit from a larger sieve also,
> which I'm testing in a different patch.
>
> On Thu, Mar 5, 2020 at 6:05 PM Seth Troisi <braintwo at gmail.com> wrote:
>
>> I tried using a segmented sieve. this avoids multiplication or mod in the
>> inner loop.
>> It's an improvement and the code is easy to comprehend.
>>
>>
>>
>>
>> On Wed, Feb 12, 2020 at 10:25 AM Niels Möller <nisse at lysator.liu.se>
>> wrote:
>>
>>> "Marco Bodrato" <bodrato at mail.dm.unipi.it> writes:
>>>
>>> > Ciao,
>>> >
>>> > Il Mer, 12 Febbraio 2020 7:26 am, Niels Möller ha scritto:
>>> >> And for a small prime gap (common case), this cost should be the main
>>> >> part of the sieving. So it would make sense to optimize it. If we also
>>> >> use the ptab, we could compute modulo single-limb prime products using
>>> >> precomputed constants.
>>> >
>>> > Yes, of course.
>>>
>>> Another version below, using ptab. Timing:
>>>
>>> $ ./tune/speed -s 100-300,500,1000 -f 1.1 mpz_nextprime
>>>         mpz_nextprime
>>> 100       0.000116762
>>> 110       0.000126770
>>> 121       0.000147203
>>> 133       0.000257994
>>> 146       0.000291051
>>> 160       0.000343168
>>> 176       0.000386605
>>> 193       0.000625994
>>> 212       0.000670035
>>> 233       0.000763240
>>> 256       0.000942154
>>> 281       0.001418639
>>> 500       0.006632588
>>> 1000      0.073935992
>>>
>>> So also close to 15% speedup for 100 bits, and the same as previous
>>> version, almost 25%, for 1000 bits.
>>>
>>> I used the very crude tuning
>>>
>>>   limit = (nbits < 200) ? nbits / 2 : PTAB_SIZE;
>>>
>>> to avoid regressing for smaller sizes.
>>>
>>> > And the next question will be: should we generate on the fly an even
>>> > larger table of primes when the required size grows beyond the
>>> > precomputed table?
>>>
>>> Don't know. I don't quite understand how large a table could be
>>> beneficial, but I guess it might make sense to at least limit it to size
>>> of L1 cache.
>>>
>>> Regards,
>>> /Niels
>>>
>>> /* mpz_nextprime(p,t) - compute the next prime > t and store that in p.
>>>
>>> Copyright 1999-2001, 2008, 2009, 2012 Free Software Foundation, Inc.
>>>
>>> Contributed to the GNU project by Niels Möller and Torbjorn Granlund.
>>>
>>> This file is part of the GNU MP Library.
>>>
>>> The GNU MP Library is free software; you can redistribute it and/or
>>> modify
>>> it under the terms of either:
>>>
>>>   * the GNU Lesser General Public License as published by the Free
>>>     Software Foundation; either version 3 of the License, or (at your
>>>     option) any later version.
>>>
>>> or
>>>
>>>   * the GNU General Public License as published by the Free Software
>>>     Foundation; either version 2 of the License, or (at your option) any
>>>     later version.
>>>
>>> or both in parallel, as here.
>>>
>>> The GNU MP Library is distributed in the hope that it will be useful, but
>>> WITHOUT ANY WARRANTY; without even the implied warranty of
>>> MERCHANTABILITY
>>> or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
>>> for more details.
>>>
>>> You should have received copies of the GNU General Public License and the
>>> GNU Lesser General Public License along with the GNU MP Library.  If not,
>>> see https://www.gnu.org/licenses/.  */
>>>
>>> #include "gmp-impl.h"
>>> #include "longlong.h"
>>>
>>> #if 1
>>> struct gmp_primes_dtab {
>>>   mp_limb_t p;
>>>   mp_limb_t binv;
>>>   mp_limb_t lim;
>>> };
>>>
>>> struct gmp_primes_ptab {
>>>   mp_limb_t ppp;        /* primes, multiplied together */
>>>   mp_limb_t cps[7];     /* ppp values pre-computed for mpn_mod_1s_4p */
>>>   gmp_uint_least32_t idx:24;    /* index of  first primes in dtab */
>>>   gmp_uint_least32_t np :8;     /* number of primes related to this
>>> entry */
>>> };
>>>
>>> static const struct gmp_primes_dtab dtab[] =
>>> {
>>> #define WANT_dtab
>>> #define P(p,inv,lim) {p, inv,lim}
>>> #include "trialdivtab.h"
>>> #undef WANT_dtab
>>> #undef P
>>> };
>>>
>>> #define DTAB_SIZE (sizeof (dtab) / sizeof (dtab[0]))
>>>
>>> static const struct gmp_primes_ptab ptab[] =
>>> {
>>> #define WANT_ptab
>>> #include "trialdivtab.h"
>>> #undef WANT_ptab
>>> };
>>>
>>> #define PTAB_SIZE (sizeof (ptab) / sizeof (ptab[0]))
>>>
>>> void
>>> mpz_nextprime (mpz_ptr p, mpz_srcptr n)
>>> {
>>>   mp_limb_t *moduli;
>>>   unsigned long difference;
>>>   int i;
>>>   unsigned prime_limit;
>>>   mp_bitcnt_t nbits;
>>>   unsigned incr;
>>>   TMP_DECL;
>>>
>>>   /* First handle tiny numbers */
>>>   if (mpz_cmp_ui (n, 2) < 0)
>>>     {
>>>       mpz_set_ui (p, 2);
>>>       return;
>>>     }
>>>   mpz_add_ui (p, n, 1);
>>>   mpz_setbit (p, 0);
>>>
>>>   if (mpz_cmp_ui (p, 7) <= 0)
>>>     return;
>>>
>>>   MPN_SIZEINBASE_2EXP(nbits, PTR(p), SIZ(p), 1);
>>>
>>>   if (nbits < 200)
>>>     {
>>>       prime_limit = nbits / 2;
>>>       if (SIZ(p) == 1)
>>>         {
>>>           /* If the prime list includes any prime >= p, do plain
>>>              search. */
>>>           mp_limb_t limb = PTR(p)[0];
>>>           unsigned last = ptab[prime_limit-1].idx +
>>> ptab[prime_limit-1].np - 1;
>>>           if (dtab[last].p >= limb)
>>>             {
>>>               while (dtab[last-1].p >= limb)
>>>                 last--;
>>>               PTR(p)[0] = dtab[last].p;
>>>               return;
>>>             }
>>>         }
>>>     }
>>>   else
>>>     prime_limit = PTAB_SIZE;
>>>
>>>   TMP_MARK;
>>>
>>>   /* Compute residues modulo small odd primes */
>>>   moduli = TMP_ALLOC_TYPE (prime_limit, mp_limb_t);
>>>
>>>   for (;;)
>>>     {
>>>       for (i = 0; i < prime_limit; i++)
>>>         {
>>>           mp_limb_t ppp = ptab[i].ppp;
>>>           const mp_limb_t *cps = ptab[i].cps;
>>>           moduli[i] = mpn_mod_1s_4p (PTR(p), SIZ(p), ppp << cps[1], cps);
>>>         }
>>>
>>> #define INCR_LIMIT 0x10000      /* deep science */
>>>
>>>       for (difference = incr = 0; incr < INCR_LIMIT; difference += 2)
>>>         {
>>>           /* First check divisibility based on prime list */
>>>           for (i = 0; i < prime_limit; i++)
>>>             {
>>>               /* FIXME: Ensure that this addition can't overflow */
>>>               mp_limb_t m = moduli[i] + incr;
>>>               int idx = ptab[i].idx;
>>>               int np = ptab[i].np;
>>>               int j;
>>>
>>>               for (j = 0; j < np; j++)
>>>                 if (m * dtab[idx+j].binv <= dtab[idx+j].lim)
>>>                   goto next;
>>>             }
>>>
>>>           mpz_add_ui (p, p, difference);
>>>           difference = 0;
>>>
>>>           /* Miller-Rabin test */
>>>           if (mpz_millerrabin (p, 25))
>>>             goto done;
>>>         next:;
>>>           incr += 2;
>>>         }
>>>       mpz_add_ui (p, p, difference);
>>>       difference = 0;
>>>     }
>>>  done:
>>>   TMP_FREE;
>>> }
>>> #else
>>> static const unsigned char primegap[] =
>>> {
>>>   2,2,4,2,4,2,4,6,2,6,4,2,4,6,6,2,6,4,2,6,4,6,8,4,2,4,2,4,14,4,6,
>>>   2,10,2,6,6,4,6,6,2,10,2,4,2,12,12,4,2,4,6,2,10,6,6,6,2,6,4,2,10,14,4,2,
>>>   4,14,6,10,2,4,6,8,6,6,4,6,8,4,8,10,2,10,2,6,4,6,8,4,2,4,12,8,4,8,4,6,
>>>
>>> 12,2,18,6,10,6,6,2,6,10,6,6,2,6,6,4,2,12,10,2,4,6,6,2,12,4,6,8,10,8,10,8,
>>>
>>> 6,6,4,8,6,4,8,4,14,10,12,2,10,2,4,2,10,14,4,2,4,14,4,2,4,20,4,8,10,8,4,6,
>>>   6,14,4,6,6,8,6,12
>>> };
>>>
>>> #define NUMBER_OF_PRIMES 167
>>>
>>> void
>>> mpz_nextprime (mpz_ptr p, mpz_srcptr n)
>>> {
>>>   unsigned short *moduli;
>>>   unsigned long difference;
>>>   int i;
>>>   unsigned prime_limit;
>>>   unsigned long prime;
>>>   mp_size_t pn;
>>>   mp_bitcnt_t nbits;
>>>   unsigned incr;
>>>   TMP_SDECL;
>>>
>>>   /* First handle tiny numbers */
>>>   if (mpz_cmp_ui (n, 2) < 0)
>>>     {
>>>       mpz_set_ui (p, 2);
>>>       return;
>>>     }
>>>   mpz_add_ui (p, n, 1);
>>>   mpz_setbit (p, 0);
>>>
>>>   if (mpz_cmp_ui (p, 7) <= 0)
>>>     return;
>>>
>>>   pn = SIZ(p);
>>>   MPN_SIZEINBASE_2EXP(nbits, PTR(p), pn, 1);
>>>   if (nbits / 2 >= NUMBER_OF_PRIMES)
>>>     prime_limit = NUMBER_OF_PRIMES - 1;
>>>   else
>>>     prime_limit = nbits / 2;
>>>
>>>   TMP_SMARK;
>>>
>>>   /* Compute residues modulo small odd primes */
>>>   moduli = TMP_SALLOC_TYPE (prime_limit, unsigned short);
>>>
>>>   for (;;)
>>>     {
>>>       /* FIXME: Compute lazily? */
>>>       prime = 3;
>>>       for (i = 0; i < prime_limit; i++)
>>>         {
>>>           moduli[i] = mpz_tdiv_ui (p, prime);
>>>           prime += primegap[i];
>>>         }
>>>
>>> #define INCR_LIMIT 0x10000      /* deep science */
>>>
>>>       for (difference = incr = 0; incr < INCR_LIMIT; difference += 2)
>>>         {
>>>           /* First check residues */
>>>           prime = 3;
>>>           for (i = 0; i < prime_limit; i++)
>>>             {
>>>               unsigned r;
>>>               /* FIXME: Reduce moduli + incr and store back, to allow for
>>>                  division-free reductions.  Alternatively, table
>>> primes[]'s
>>>                  inverses (mod 2^16).  */
>>>               r = (moduli[i] + incr) % prime;
>>>               prime += primegap[i];
>>>
>>>               if (r == 0)
>>>                 goto next;
>>>             }
>>>
>>>           mpz_add_ui (p, p, difference);
>>>           difference = 0;
>>>
>>>           /* Miller-Rabin test */
>>>           if (mpz_millerrabin (p, 25))
>>>             goto done;
>>>         next:;
>>>           incr += 2;
>>>         }
>>>       mpz_add_ui (p, p, difference);
>>>       difference = 0;
>>>     }
>>>  done:
>>>   TMP_SFREE;
>>> }
>>>
>>> #endif
>>>
>>> --
>>> Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
>>> Internet email is subject to wholesale government surveillance.
>>>
>>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fast_test_orig.patch
Type: text/x-patch
Size: 5309 bytes
Desc: not available
URL: <https://gmplib.org/list-archives/gmp-devel/attachments/20200306/5efb2dab/attachment-0001.bin>