faster modular arithmetic?
Jason Moxham
j.l.moxham@maths.soton.ac.uk
Sat, 12 Jul 2003 21:13:53 +0100
very roughly... ( we rely on a*b mod 2^k+-1 is faster than a*b mod 2^k is
faster than a*b)
given a fixed modulas m we wish to calculate x%m for many x
using typical barrett method(simplified)
one off part
choose R1,R2 such that 3<=m<=x<R1 and 2m<R2
let Inv=floor(R1/m)
for each x part
1) let q=floor(Inv*x/R1) // floor(x/m)=q or q+1
2) let r=x%R2 - (q*m%R2) // this is >0
while( r > m) r=r-m // loop at most twice
then r=x%m
where normally we choose R1=2^n=R2/2 for some suitible n (depending on our
x's)
clearly a modular reduction in this case can be done with two full
multiplications , as is well known step 1 and 2 amount to a high part and low
part multiplication which are faster than a full mul but only for small
sizes(say <4000 limbs)
by changing R2=2^(n+1)+1 or R2=2^(n+1)-1 , then we still get the same result
but we can calculate q*m%R2 faster than q*m , for the plus case use gmp FFT
mod 2^k+1 , and for minus case we use modular multiplication(like
Knuth(factorize R2))
Of course n,k have to be chosen carefully , but experiments seem to indicate
that a preliminary version of this is faster than a well optimized high/low
part barrett division for a "largish" range of limb sizes
? can we calculate floor(Inv*x/(2^k+-1)) faster than a high part mul (which is
faster than a mul) for some ranges
The same approach applys to Montgomery reduction (simplified)
but here R1 must be equal to R2
test gcd(m,R1)==1
let Inv= -m^-1 mod R1
1)let q=x*Inv%R1
2)let r=(x+m*q)/R2 // exact div (as R1==R2)
if r>m then r=r-m
then r==xR1^-1 mod m
typically we use R1=2^n for suitible n and so get step 1 to be a low part mul
and step 2 to be a high part mul (+1 if R2|x)
as above we can change R1 to 2^n+-1 , and this can speed up step 1 (faster
than lowpart mul)
a few problems remain ,
how to test gcd(m,2^n+1 or 2^n-1)=1 fast ? ( may not be too important at the
sizes where x*m%(2^n+1) is faster than x*m%2^n )
exact div by 2^n+1 is easy , but can we get a faster verision ie calculate
x+m*q/R2 directly rather than use a full mul ( eg like when R2=2^n then we
can use a high half mul)
perhaps something like if r<2^n then (x+m*q)/(2^n+1)= (x+m*q)%2^n = r ( so
can use a low part mul)
Its all bit sketchy , but it seems promising enough to continue
jason