fixed-size mpn_mul_n for small n?

Mon Feb 13 19:00:43 CET 2012

       Niels,

> > agreed. But to avoid this dependent multiplication, I believe
> > Montgomery-Svoboda has more potential. Instead of performing n addmul_1 calls
> > with N, you perform n-1 call with k*N such that the low limb of k*N is -1
> > (thus the quotient selection is trivial) and one call with N.
> 
> Ah, that's clever. I wasn't aware of that method.

the subquadratic version is described in [1], Algorithm 2.8. However I only
recently figured out that with (kN+1)/B precomputed, you only need an addmul_1
of length n (instead of n+1) at each step for the quadratic version.

> > for GMP-ECM, the most important range is from 10 to 20 limbs, i.e., about 200
> > to 400 decimal digits.

for RSA signature and encryption, 16 and 32 limbs are important targets too.

> Using completely unrolled code for all these sizes (I'm now primarily
> thinking of squaring) seems a bit impractical. Maybe one could do
> something reasonable with specialcase code for collecting the
> off-diagonal terms (long code with jumpts into), and then a plain loop
> for the diagonal and the final shift + add.

I found a variant, but I'm not sure it is better:

1) first put the diagonal terms in place (this will fill exactly the 2n buffer)
2) divide by 2 (if the input is odd, store the carry out)
3) accumulate the off-diagonal terms (could be done in assembly as you suggest)
4) multiply by 2 (and restore the carry out)
5) perform the usual reduction

Paul

[1] Modern Computer Arithmetic, Richard Brent and Paul Zimmermann, Cambridge University Press, 2010, available online.