fixed-size mpn_mul_n for small n?

Sun Feb 12 20:29:51 CET 2012

       Niels,

> For these moderate sizes, does it pay off to precompute a full inverse
> for the montgomery reduction, rather than using redc_1 or redc_2? (I
> don't quite understand which lines in you benchmark data I should look
> at).

I believe those sizes are too small. With a full inverse, we need that two
short products (one mullo and one mulhi) are faster than redc_1 or redc_2.
This does not seem to be the case for 14 limbs, where mpn_redc_n takes
0.45us, compared to 0.23us for mpn_redc_1:

Time in microseconds per call, size=14
mpn_mul_n  = 0.179211
mpn_sqr    = 0.123209
mpn_redc_1 = 0.229614
mpn_redc_2 = 0.235214
ecm_redc3  = 0.274418 # this is GMP-ECM variable-size assembly redc
mpn_redc_n = 0.448028
mulredc    = 0.431227 # this is GMP-ECM assembly combined mul+redc
mul+redc_1 = 0.420027
mul+redc_2 = 0.414426
mul+redc3  = 0.464830
mul+redc_n = 0.627240
sqr+redc_1 = 0.358423
sqr+redc_2 = 0.352823
sqr+redc3  = 0.403226
sqr+redc_n = 0.554434

Legend: "mul" means mpn_mul_n, "sqr" means mpn_sqr

> Have you tried the "bidirectional" trick we discussed a while ago (IIRC
> it was your idea): For size n, instead of the standard montgomery
> representation x' = B^n x (mod m), use the representation x' = B^{n/2} x (mod
> m), and each time a size 2n product is to be reduced, cancel n/2
> limbs from the right, and n/2 from the left? For the _1 version, it
> might even make sense to make a single loop working from both ends.

no I didn't. But it doesn't save any computation, since to cancel n/2 limbs,
you need n/2 addmul_1 calls of size n, thus in total n addmul_1 calls of size
n. The only benefit is when using 2 threads reducing from the left and right.

> And back to the original question: I guess you could try to completely
> unroll the multiplication for some size of interest, and compare to the
> general mpn_mul_basecase.

I know nothing about assembly, this is why I asked on this list :-)

> My understanding is that current branch predictors are fairly good at
> the case of a loop which always runs for the same number of iterations,
> so the loop overhead (even without unrolling) shouldn't be severe.
> Totally unrolled code might have a greater potential for speedups for
> squaring, mullo and mulhi, than for mul_basecase, since the former
> probably are less friendly to the branch prediction.

ok. Any speedup in squaring (mpn_sqr) is also most welcome, since our best
ECM Stage 1 code performs 4 modular multiplications and 4 modular squaring
per iteration.

Paul