two possible coding porjects
Sun, 03 Aug 2003 19:36:25 -0500 (CDT)

[I'm not on the gmp-devel list, so if anybody tries to reply
don't edit my name out of the cc line.]

Quoting Kevin Ryde <>:
> writes:
> > When an algorithm expects to compute many modular powers
> > with the same modulus it seems to me to be worthwhile to
> > precompute the tableau.
> Sorry, I don't understand, you'll have to show what you mean.

OK, I grovel in submission.  I knew going into this that you
guys know a heck of a lot more about arithmetic than I do.  I've
become convinced that precomputing the shifts is almost certainly
a waste of time (except perhaps on Intel) and padding the copies
out with zeros would be lunacy.

Actually, I knew that last part already, I had just been thinking
about how I could accellerate my programs without delving into
the mpn layer too deeply.

So, given that the gmp homepage lists as a future feature
- New user-visible mod and division routines for invariant divisors. 
comma, what is the plan for these routines?  I have an algorithm
that does one heck of a lot of GCD and mod against a very large
invariant divisor.

> For a fixed divisor the idea is to pre-compute an inverse
> (either an approximation to 1/d, or a mod 2^n for REDC).
> That turns a division or mod into basically a high-half 
> and a low-half multiply.

Uhm, I'm doing GCD and modulus against a number that has something
like 10,000 bits.  A lot of them.  There's not a suitable ring in
which to precompute the ... well, maybe there is!

I could always work in a convenient power of two.  Hmm.
That would mean dividing out any powers of two in the
numerator, right?  [need to look that up.  What happens
when you compute a/b mod n and (a,n)!=1?]  Not a big deal,
I can do that before taking the powers and multiply it back.

> > The mpz_array_init trick of lying about the allocation
> > size bothers me a great deal.
> mpz_array_init is a highly specialized function, not
> recommended for normal use.

Then update the documentation so it says that.  The docs make
it obvious that it's special but give no indication of what
that special use might be nor how a user is to know whether
his program is using it correctly.

A perfect example of the brittleness I was talking about.

> Unfortunately overheads on small values are more or less
> inevitable in a general purpose variable precision library.

True.  Doesn't mean you have to squander resources.  Of course,
single-limb quantities don't arise often as results of arbitrary
integer computations but I'm on a 64-bit machine so anything
with a FP result would fit, as would numerator and/or denominator
of many rational numbers.  (actually, one and zero come up
fairly often.  EG, GCD(rand,rand)=1 most of the time.)

> > If nothing else it
> > contributes to fragmentation of the heap and degrades
> > locality of reference.
> Allocation strategy can at least be controlled by the
> application through setting different memory functions.

Again something that could use better documentation, but I
can always read the code.  For hard-realtime apps I usually
pick a few block sizes and maintain freelists by block size,
then keep a best-fit pool for big oddballs.  [Yes. I've used
dynamic memory allocation in hard-realtime, safety-critical
systems.  Makes me all tingly just thinking about it.]

I'd just as soon not have to think about memory allocation
when writing number theory code.  On the other hand, once
I know how and why realloc is getting called in my app I can
design an allocation strategy to make it stop.  For working
variables I may as well simply preallocate them to the size
of the invariant modulus; as long as the values are flatly
distributed in the ring that doesn't waste anything.

A good buddy-block allocator would detect free/realloc on
non-allocated pointers and reduce the per-block storage overhead
to a few bits.  It could probably be tuned to make most reallocs
free, at the cost of wasting a lot of the L2 cache.  The
direct-mapped L2 cache on the Alpha is something you've got
to keep in mind all the time when you're living with one.  I've
got three in my living room.  Even if you don't have a direct
mapped cache, any unused (or unneceaary) bytes adjacent to a value
in the cache wastes space in that cache line that could have
something valuable in it.

I know you know all this, but I get the feeling you guys focus more
on the processor pipelines than on the memory hierarchy.  For
many applications this is appropriate, since you're working with
a handful of variables that all fit in L2 if not L1.  I've filling
all available temporary and swap space with numbers, all of which
get visited as it runs.  I have to think about scheduling the page
faults and the cache fills.

Sorry if any of the above reads harshly, I've been studying
for an exam for several days and my butt is sore.  I remain
terribly grateful for the impressive work that is gmp.