faster modular arithmetic?

Jason Moxham
Sat, 12 Jul 2003 21:13:53 +0100

very roughly... ( we rely on a*b mod 2^k+-1 is faster than a*b mod 2^k is 
faster than a*b)

given a fixed modulas m we wish to calculate x%m for many x

using typical barrett method(simplified)

one off part
	choose  R1,R2 such that 3<=m<=x<R1 and 2m<R2
	let Inv=floor(R1/m)

for each x part
1)	let q=floor(Inv*x/R1)  // floor(x/m)=q or q+1
2)	let r=x%R2 - (q*m%R2)	// this is >0
	while( r > m) r=r-m	// loop at most twice

then r=x%m

where normally we choose R1=2^n=R2/2 for some suitible n (depending on our 

clearly a modular reduction in this case can be done with two full 
multiplications , as is well known step 1 and 2 amount to a high part and low 
part multiplication which are faster than a full mul but only for small 
sizes(say <4000 limbs)
by changing R2=2^(n+1)+1 or R2=2^(n+1)-1 , then we still get the same result 
but we can calculate q*m%R2 faster than q*m , for the plus case use gmp FFT 
mod 2^k+1 , and for minus case we use modular multiplication(like 
Knuth(factorize R2))
Of course n,k have to be chosen carefully , but experiments seem to indicate 
that a preliminary version of this is faster than a well optimized high/low 
part barrett division for a "largish" range of limb sizes

? can we calculate floor(Inv*x/(2^k+-1)) faster than a high part mul (which is 
faster than a mul) for some ranges 

The same approach applys to Montgomery reduction (simplified)
but here R1 must be equal to R2

test gcd(m,R1)==1
let Inv= -m^-1 mod R1
1)let q=x*Inv%R1
2)let r=(x+m*q)/R2  // exact div (as R1==R2)
if r>m then r=r-m

then r==xR1^-1 mod m

typically we use R1=2^n for suitible n and so get step 1 to be a low part mul 
and step 2 to be a high part mul (+1 if R2|x)

as above we can change R1 to 2^n+-1 , and this can speed up step 1 (faster 
than lowpart mul) 

a few problems remain , 

how to test gcd(m,2^n+1 or 2^n-1)=1 fast ? ( may not be too important at the 
sizes where x*m%(2^n+1) is faster than x*m%2^n )

exact div by 2^n+1 is easy , but can we get a faster verision ie calculate 
x+m*q/R2 directly rather than use a full mul ( eg like when R2=2^n then we 
can use a high half mul) 
perhaps something like if r<2^n then (x+m*q)/(2^n+1)= (x+m*q)%2^n = r  ( so 
can use a low part mul) 

Its all bit sketchy , but it seems promising enough to continue