Improvement to mpz_nextprime

Thu Jun 22 17:31:11 CEST 2006

On Jun 22, 2006, at 9:33 AM, Torbjorn Granlund wrote:

>> For example:
>>
>> * The prime
>> * An inverse of the prime and some shift constants for fast division
>>   by the prime
>
>   I assume you mean either something like Barrett division or the
>   algorithm of your paper with Peter Montgomery. In either case,
>   the  values to be stored are precision-dependent, no? Unless one
>   were  looking for a series of primes of the same size (and
>   usually this  means in the same interval, so that one would be
>   compelled to write  his own sieve in that case), it wouldn't make
>   sense to store these  constants -- and, in particular, I
>   understand it wouldn't make sense  to use the algorithm for
>   division by constants.
>
> Why?
>
> Here and there in GMP, we now divide by prime small numbers, an
> operation that takes on the order of 200 plain operations on a modern
> CPU.
>
> I really think it makes sense to trade division for a multiply and
> some plain operations.  Why do you say it doesn't make sense?

I may be misunderstanding the paper (just to be clear: `Division by  
invariant integers using multiplication', Torbjorn Granlund and Peter  
Montgomery), but here's a quotation: `The algorithms are ineffective  
when a divisor is not invariant'. The divisors are invariant indeed,  
but I understand the precomputations are divisor and *precision*- 
dependent -- i.e. if the parameter N (using the notation from the  
paper) changes, then the constants would have to be recomputed.  
Perhaps you're assuming that, if many consecutive calls are made to  
mpz_nextprime(), it's likely that the arguments are of the same size,  
so that precomputations could be reused from one computation to the  
other?

By the way, if you're worried with that case, why not offer an extra  
function, something like mpz_nextnprimes(rop[], op, n), where the  
next n primes from op are found and stored in rop[]?

> The code for initializing the residues will want to do a lot of
> dividing, and here tabled preinverses will help a lot.

By tabled preinverses you mean the constants used in Barrett  
division, right? My concern in this case (and it's probably valid for  
the method from your paper as well) is that these constants are  
computed to a large precision, as large as the argument to  
mpz_nextprime(). So if we're searching for 1k-bit primes, we'd need a  
table with 1k-bit entries per prime, so that's 78 Mbit or 10 MB to  
store a table of primes up to 1 million. I believe that raises memory  
usage and locality concerns, and it severely limits the amount of  
primes to use in the sieve. In that light, I think Bernstein's tree- 
product method looks better. Unless of course I'm mistaken and  
there's a more space-efficient way of storing pre-computed inverses.

>> A prime can be represented with a byte, representing the distance  
>> from
>> the previous prime.  (Since distances are even between odd primes, we
>> can represent a huge table that way, limited by 304599508537...)
>
>   Indeed, and this representation is more efficient than a simple
>   bitmap (no wheel) once primes are larger than about 8k, if I'm
>   not  mistaken (using the aproximation n/(log(n) - 1) for the
>   prime number  theorem). But anyway, once a wheel is implemented,
>   even a fairly  simple wheel, representation by differences would
>   lose on all  practical ranges. Of course, if a sizable amount of
>   information has  to be represented besides the prime itself, then
>   it doesn't really  matter.
>
> What matters still is access times.  Both a bitmap and deltas have
> their pros and cons.

Deltas are surely easier to access, but I think an algorithm would be  
bottlenecked by the calculation of starting points anyway, so it  
makes no difference.

In fact, I was starting to think it might be better to write a sieve  
to produce the small primes (2, 3, 5, 7, ...) and use them in the  
main sieve. Storage problems go away this way.

>
>   Truth be told, even a meager 8 KB L1 D-cache such as the P4's
>   would  be enough to sieve expected ranges for 28k-digit
>   primes. Currently,  at this size one is much better off using
>   PFGW, as I recall. Looking  forward, a 32 KB L1 D-cache is a
>   reasonable assumption, enough for  111k-digit primes. If you
>   think that's not enough, then the sieve  array itself could use a
>   wheel, although things would complicate a  little bit; a 2*3*5*7
>   = 210 wheel combined with a 32 KB L1 D-cache  would be good for
>   ~500k-digit primes.
>
> I'd say to use a 1 Kibyte sieve, or smaller if the range suggests
> that.  If one doesn't find a prime, one just needs to rerun the sieve
> for the next block.  (To make sure that mechanism works, test it with
> a one byte sieve.)

That's a good idea, as long as the starting point computation is the  
bottleneck of the sieve. What worries me is the storage of residues  
to start the next computation. If a reasonable amount of primes is  
used, this is likely to overflow even L2 cache (e.g. there's 78k  
primes up to 1 million, times 4 bytes for each, amounts to 312 KB;  
and that would be a small amount of primes for a realistic sieve).  
Then again, access pattern for this array would be sequential, and  
hardware prefetchers in the CPU should ensure they're available when  
needed.

>
>   Actually, I rewrote everything (hey, I was bored...)
>
> Rewrote as compared to what?

I actually meant that I wrote everything from scratch, instead of  
reusing the code that you sent.

Décio

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 186 bytes
Desc: This is a digitally signed message part
Url : http://gmplib.org/list-archives/gmp-discuss/attachments/20060622/edcf2f03/PGP.bin