two possible coding porjects
Thu, 31 Jul 2003 11:52:03 -0500 (CDT)

> > accellerate
> > just about any division-related operation against n.
> How?  Division isn't done by trial subtraction, but by a limb
> quotient

Sorry, I wasn't being as careful as I would have been in
a more formal discussion.  First, division per se might
not benefit at all and certainly not enough to justify
building the tableau.  Once it's built, though, you at the
very least have an upshift of n available that has no
"wasted" leading zeros.  Furthermore, when computing squares
mod n you can predict exactly which shift of n to subtract
almost all the time, and you're off by one the rest of the
time.  This should make power-mod computations a lot faster,
unless I'm missing something really important.

The tableau approach suggests itself to me because it would
be a convenient way to accellerate a specific computation
from my upcoming paper, which does involve a subtraction
that I don't think can be eliminated (though it may be
obviated by testing low-order bits in some cases).

When an algorithm expects to compute many modular powers
with the same modulus it seems to me to be worthwhile to
precompute the tableau.  Furthermore, when your limb quotient
is found you still have to do a multiplication and a subtraction
to get the remainder (if it's needed), right?  Again not worth
the candle in ad-hoc computations but if the application commits
to using one particular divisor for a while it might pay off.

> > this would at the very least admit error reporting when
> > non-reallocable storage overflows.
> You mean at the mpn level?  Routines using those must simply
> take care not to go past their defined operands.  Some of the 


The mpz_array_init trick of lying about the allocation
size bothers me a great deal.  I am using it to attach
mpz handles to limb arrays residing in memory mapped
files and used in a read-only fashion and it makes my
hair stand up.  Non-reallocable limb arrays should be
tagged in some explicit way.  I also chafe at the waste
of space involved in using a pointer to a malloc block
to store a single-limb number.  If nothing else it
contributes to fragmentation of the heap and degrades
locality of reference.

I have no doubt that the existing code is reasonably well
thought out.  My suggestions, based on 20+ years of writing
and maintaining code that manipulates pointer-threaded
data structures, were intended to improve the maintainability
and robustness of the code.  What you've got isn't wrong,
it's just brittle -- and brittle is costly.

[Of course, the if the tableau is bigger than the L2 cache
then it might turn out to be a crock, too.  None of this
is easy.  On the Alphas I'm using it might always be faster
to do the shifts from a copy in L1 cache.  On an architecture
with more register pressure and good prefetch bandwidth it
might always be better to build the tableau.]