Mod operation

Tue Dec 1 14:31:19 CET 2009

nisse at lysator.liu.se (Niels Möller) writes:

> To guess whether or not it can be a saving to do half the work using
> m3 and a 2n x n multiplication, it would be useful to know which is
> fastest, a 2nxn multiplication or current redc_n (on size n).

I have added speed support for redc_n and toom42_mul, and according to
speed on my machine toom42 wins (./speed -s 500 mpn_redc_n, ./speed -s
1000 mpn_toom42_mul).

So lets consider a "bipartite" redc, reducing x (4n limbs) by m (2n
limbs). The most significant n limbs is eliminated using m3 and a 2n x n
multiply. Then the least significant n limbs are eliminated using an
nxn mullo to get a quotient, and a 2n mulmod_bnm1 (using wraparound,
just like redc_n).

Consider n = 500. This is what I get on may laptop (x86_32):

  mpn_toom42_mul	1000      0.001056473
  mpn_mullow_n           500      0.000491507
  mpn_mulmod_bnm1       1000      0.000952538
----------------------------------------------
                                  0.002500518

To compare with

  mpn_redc_n            1000      0.008734385

These numbers don't quite make sense to me, it looks too good to be
true. Compare to what redc_n should cost, using the same primitives,

  mpn_mullow_n          1000      0.001369270
  mpn_mulmod_bnm1       1000      0.000952538
----------------------------------------------
                                  0.002321808

These figures are somewhat easier to believe, and they indicate that
current redc_n should still win. But I don't understand why ./speed
mpn_redc_n gives so poor results. Maybe benchmarking on this laptop is
generally unreliable.

With this breakdown of the operations, the battle is between mullo of
size 2n and (mullo of size n + 2n x n multiply).

Regards,
/Niels