Fallback code for two-word multiplication

Mon Jan 2 23:19:04 UTC 2017

Hi,

This is a sort of follow-up for last month discussion about two-word multiplication.

I have been trying another implementation, which has no check for overflow.
Most often, it seems to be significantly faster.
My code is available here: http://94.23.21.190/publicshare/mul64x2.tar.gz

The Makefile is provided, just do make.
Execute with these commands:
./mul64x2 bench-umul 10000000
./mul64x2 bench-smul 10000000

There are several test implementations in my code.
There are two points I wanted to investigate:
- the impact of my no-overflow unsigned multiplication version
- the impact of my sign extension version: use signed shift instead of unsigned shift
  this saves two subtractions

The different multiplication versions are:
umul:
- naive .... (forget that)
- nov ...... my no-overflow
- nov2 ..... my no-overflow, but written with all one-word multiplications first, like done in GMP
- gmp ...... GMP implementation, for reference
smul:
- naive .... (forget that)
- nov ...... my no-overflow
- nov2 ..... my no-overflow, but written with all one-word multiplications first, like done in GMP
- novgmp ... (not interesting) nov version, but the sign extension uses unsigned shift, like done in GMP
- nov2gmp .. (not interesting) nov2 version, but the sign extension uses if(a<b) c++
- gmp ...... GMP implementation, for reference
- gmp2 ..... GMP implementation, but the sign extension uses my signed shift
- gmp3 ..... GMP implementation, but the sign extension uses if(a<b) c++

I tested on three different machines:
- Core2 Duo P7350, laptop
- i7-4790
- Cortex-A9, armv7l, board Odroid-X2

Here are the execution time stats for the different multiplication versions, compared with gmp version:
(negative values mean faster than GMP)

Processor Core2 Duo P7350, laptop
  umul: nov ... +0.8%
        nov2 .. +0.72%
  smul: nov ... -5%
        nov2 .. -11%
        gmp2 .. -0.46%
        gmp3 .. +14.9%

Processor i7-4790
  umul: nov ... -7.6%
        nov2 .. -7.6%
  smul: nov ... -7.2%
        nov2 .. -7.2%
        gmp2 .. +6.7%
        gmp3 .. -6.2%

Processor Cortex-A9, armv7l, board Odroid-X2
  umul: nov ... -1.75%
        nov2 .. -12.3%
  smul: nov ... -1.93%
        nov2 .. -0.74%
        gmp2 .. -3.02%
        gmp3 .. -4.96%

Globally, my versions nov and nov2 are faster.
Only on my Core2 Duo processor it is slower but not by much.

But... I don't understand some of the results.

I don't understand why gmp2 is slower on i7... it should be faster like on my laptop.

I don't understand why the perf of nov and nov2 are different on Cortex-A9.
Basically, it's just some lines of code that are swapped in the code.
Why would the compiler scheduler output code with that much different perf?

Also I don't understand why there is such difference with nov2, for smul and umul...

Finally, when compiled, my nov and nov2 versions use a bit more processor instructions than the gmp version.
So I don't know what would be the perf on other machines.
That would depend on whether there is a conditional move/add instruction, etc

So, for now I'd just like some comments about my nov or nov2 versions.

And happy new year!
Adrien