Mod operation
Niels Möller
nisse at lysator.liu.se
Tue Dec 1 14:31:19 CET 2009
nisse at lysator.liu.se (Niels Möller) writes:
> To guess whether or not it can be a saving to do half the work using
> m3 and a 2n x n multiplication, it would be useful to know which is
> fastest, a 2nxn multiplication or current redc_n (on size n).
I have added speed support for redc_n and toom42_mul, and according to
speed on my machine toom42 wins (./speed -s 500 mpn_redc_n, ./speed -s
1000 mpn_toom42_mul).
So lets consider a "bipartite" redc, reducing x (4n limbs) by m (2n
limbs). The most significant n limbs is eliminated using m3 and a 2n x n
multiply. Then the least significant n limbs are eliminated using an
nxn mullo to get a quotient, and a 2n mulmod_bnm1 (using wraparound,
just like redc_n).
Consider n = 500. This is what I get on may laptop (x86_32):
mpn_toom42_mul 1000 0.001056473
mpn_mullow_n 500 0.000491507
mpn_mulmod_bnm1 1000 0.000952538
----------------------------------------------
0.002500518
To compare with
mpn_redc_n 1000 0.008734385
These numbers don't quite make sense to me, it looks too good to be
true. Compare to what redc_n should cost, using the same primitives,
mpn_mullow_n 1000 0.001369270
mpn_mulmod_bnm1 1000 0.000952538
----------------------------------------------
0.002321808
These figures are somewhat easier to believe, and they indicate that
current redc_n should still win. But I don't understand why ./speed
mpn_redc_n gives so poor results. Maybe benchmarking on this laptop is
generally unreliable.
With this breakdown of the operations, the battle is between mullo of
size 2n and (mullo of size n + 2n x n multiply).
Regards,
/Niels
More information about the gmp-devel
mailing list