Fallback code for two-word multiplication
Adrien Prost-Boucle
adrien.prost-boucle at laposte.net
Mon Jan 2 23:19:04 UTC 2017
Hi,
This is a sort of follow-up for last month discussion about two-word multiplication.
I have been trying another implementation, which has no check for overflow.
Most often, it seems to be significantly faster.
My code is available here: http://94.23.21.190/publicshare/mul64x2.tar.gz
The Makefile is provided, just do make.
Execute with these commands:
./mul64x2 bench-umul 10000000
./mul64x2 bench-smul 10000000
There are several test implementations in my code.
There are two points I wanted to investigate:
- the impact of my no-overflow unsigned multiplication version
- the impact of my sign extension version: use signed shift instead of unsigned shift
this saves two subtractions
The different multiplication versions are:
umul:
- naive .... (forget that)
- nov ...... my no-overflow
- nov2 ..... my no-overflow, but written with all one-word multiplications first, like done in GMP
- gmp ...... GMP implementation, for reference
smul:
- naive .... (forget that)
- nov ...... my no-overflow
- nov2 ..... my no-overflow, but written with all one-word multiplications first, like done in GMP
- novgmp ... (not interesting) nov version, but the sign extension uses unsigned shift, like done in GMP
- nov2gmp .. (not interesting) nov2 version, but the sign extension uses if(a<b) c++
- gmp ...... GMP implementation, for reference
- gmp2 ..... GMP implementation, but the sign extension uses my signed shift
- gmp3 ..... GMP implementation, but the sign extension uses if(a<b) c++
I tested on three different machines:
- Core2 Duo P7350, laptop
- i7-4790
- Cortex-A9, armv7l, board Odroid-X2
Here are the execution time stats for the different multiplication versions, compared with gmp version:
(negative values mean faster than GMP)
Processor Core2 Duo P7350, laptop
umul: nov ... +0.8%
nov2 .. +0.72%
smul: nov ... -5%
nov2 .. -11%
gmp2 .. -0.46%
gmp3 .. +14.9%
Processor i7-4790
umul: nov ... -7.6%
nov2 .. -7.6%
smul: nov ... -7.2%
nov2 .. -7.2%
gmp2 .. +6.7%
gmp3 .. -6.2%
Processor Cortex-A9, armv7l, board Odroid-X2
umul: nov ... -1.75%
nov2 .. -12.3%
smul: nov ... -1.93%
nov2 .. -0.74%
gmp2 .. -3.02%
gmp3 .. -4.96%
Globally, my versions nov and nov2 are faster.
Only on my Core2 Duo processor it is slower but not by much.
But... I don't understand some of the results.
I don't understand why gmp2 is slower on i7... it should be faster like on my laptop.
I don't understand why the perf of nov and nov2 are different on Cortex-A9.
Basically, it's just some lines of code that are swapped in the code.
Why would the compiler scheduler output code with that much different perf?
Also I don't understand why there is such difference with nov2, for smul and umul...
Finally, when compiled, my nov and nov2 versions use a bit more processor instructions than the gmp version.
So I don't know what would be the perf on other machines.
That would depend on whether there is a conditional move/add instruction, etc
So, for now I'd just like some comments about my nov or nov2 versions.
And happy new year!
Adrien
More information about the gmp-devel
mailing list