Cofactor canonicalisation of mpn_gcdext
Torbjorn Granlund
tg at gmplib.org
Thu Dec 3 19:10:23 CET 2009
nisse at lysator.liu.se (Niels Möller) writes:
I gave div1 a try, and it actually seemed to make gcdext slower (./speed
-c -s1 mpn_gcdext on my x86_32 laptop).
32-bit machines typically perform division faster than 64-bit machines.
And in the case of x86, there is a more severe shortage of registers in
the 32-bit instruction set than in the 64-bit instruction set; that will
hurt div1.
> If you think this code is right, don't be shy: please push it into the
> main repo and check http://gmplib.org/devel/testmachines.shtml tomorrow.
Pushed in now.
Thanks!
Benchmarking on shell, I get (for the old gcdext code):
$ ./speed -c -s1-5 mpn_gcdext
overhead 6.06 cycles, precision 10000 units of 4.66e-11 secs, CPU freq 21438.42 MHz
mpn_gcdext
1 2337.48
2 4782.41
3 7249.04
4 9705.38
5 12155.41
and for the just pushed in code, with one /-division per iteration in gcdext_1, I
get
1 1585.38
2 4659.20
3 7091.04
4 9503.12
5 11924.64
So, on shell we get a speedup. Cute.
I bet that we do not get a speedup on repentium or panther or tiger.
Only 45nm Core 2 and Core i have early-out dividers, I think. An
early-out divider will help here.
Small-operand gcd and gcdext are interesting and important things to
optimise. We might define different approaches, and let tuneup choose
the best one.
--
Torbjörn
More information about the gmp-devel
mailing list