GMP used during 3 and a half years to solve MIT's LCS35

Wed May 1 20:56:06 UTC 2019

tg at gmplib.org (Torbjörn Granlund) writes:

  Niall Emmart <nemmart at hotmail.com> writes:

    Maybe the fastest approach would be to use the low level routines to
    do repeated squaring in Montgomery space.

  That would speed things up a lot iff the comutations are performed in an
  invariant ring (or rings).

Some timing observations, using 2048-bit operands (on a Ryzen 2700X):

sqr     0.000271
mul     0.000426
mod     0.000874

Why is mod so much slower than mul?  The time is spent in inverting the
high divisor limb, and then in slightly obsolete, serialised schoolbook
division code.

If we exported invariant divisor Montgomery functions mod should become
only very slightly slower than mul for these operand sizes.

If we exported invariant divisor Euclidean schoolbook division, and
rewrote the loops to avoid serialisation, then that too should make mod
only slightly slower than mul.

The result would be that sqr+mod would run not in 1.15μs but about
0.7μs (for 2048-bit operand on this Ryzen CPU).

(The serialisation in Euclidean schoolbook division is due to our late
quotient limb approximation.  One could typically move it to one full
inner loop iteration earlier, and let it stew in the OoO pipeline while
the inner loop executes for the previous quotient limb approximation.)

-- 
Torbjörn
Please encrypt, key id 0xC8601622