fixed-size mpn_mul_n for small n?

Mon Feb 13 13:12:03 CET 2012

Zimmermann Paul <Paul.Zimmermann at loria.fr> writes:

> I believe those sizes are too small. With a full inverse, we need that two
> short products (one mullo and one mulhi) are faster than redc_1 or
> redc_2. This does not seem to be the case for 14 limbs, where mpn_redc_n takes
> 0.45us, compared to 0.23us for mpn_redc_1:

I see.

>> And back to the original question: I guess you could try to completely
>> unroll the multiplication for some size of interest, and compare to the
>> general mpn_mul_basecase.

I think there's some potential for speed up of the linear term, which is
mostly relevant for small sizes. The addmul_1 calls can run at 3 cycles
per limb or so. But then the computing the quotient involves dependent
multiplications with longer latency, so one may be able to compute the
independent left and right quotient in less time than computing two
quotients at the same end. Unclear to me if that's going to make any
difference in real code, in particular since then left-to-right quotient
will require some kind of adjustment step.

> ok. Any speedup in squaring (mpn_sqr) is also most welcome, since our best
> ECM Stage 1 code performs 4 modular multiplications and 4 modular squaring
> per iteration.

What sizes are important?

I have started to look a little into elliptic curve cryptography, and
there the sizes are pretty small. E.g., Using the standard curve over a
256-bit prime means that numbers are just four limbs on a 64-bit
machine. So in this case, I'd expect a specialized a completely unrolled
squaring function for this size could make a real difference.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.