SSE2 basecase multiplication

Thu Dec 5 20:45:47 UTC 2013

Hi.

I implemented basecase multiplication and squaring for x86 using SSE2
instructions and Comba column-wise multiplication method.
On Ivy Bridge (Intel Core i7 3517U) multiplication 10-20% faster than
present GMP basecase MMX multiplication.
Squaring is 5-10% faster than GMP MMX version.
However, on older CPU (Core 2 DUO E7500, Wolfdale) the same code is 15-30
worse than GMP MMX version.

What is good about SIMD and Comba - they perfectly match each other. It's
easy to do 2 or more multiplications in parallel.
But, to my surprize, the gain was not so good. I would expect at least 50%
over MMX code. Moreover 32-bit x86 is nearly obsolete.
I'm going to try the same approach for ARM NEON. NEON instruction set is
more elegant, than SSE2, so I'm rather optimistic
to beat current ARM GMP multiplication...

If anybody interested, I will share my present (GNU assembly and C/C++ SSE
intrinsic) implementation of basecase MUL and SQR.

Here are results (Windows, MINGW32):
~~~~~~~~~~~~~~~~~

/d/prj/3rdparty/gmp-32/tune$> ./speed.exe -Cr -s2-64 mpn_mul_basecase
mpn_mul_basecase_vb
overhead 5.29 cycles, precision 10000 units of 3.58e-010 secs, CPU freq
2794.00 MHz
        mpn_mul_basecase mpn_mul_basecase_vb
2             13.2628       #0.7281
3             12.8256       #0.8463
4             14.7880       #0.8549
5             15.7579       #0.9141
6             16.9703       #0.9563
7             18.8107       #0.9472
8             20.7070       #0.9374
9             22.4533       #0.9253
10            24.2333       #0.9179
11            26.1746       #0.9048
12            28.0606       #0.8986
13            30.0165       #0.8832
14            32.2114       #0.8684
15            33.6000       #0.8764
16            35.6812       #0.8637
17            37.6875       #0.8777
18            39.8452       #0.8755
19            41.4899       #0.8687
20            44.2875       #0.8339
21            46.4935       #0.8316
22            48.4636       #0.8322
23            50.3333       #0.8283
24            52.2222       #0.8210
25            53.5350       #0.8298
26            55.8297       #0.8223
27            57.7619       #0.8173
28            59.5867       #0.8148
29            62.5862       #0.8038
30            63.4333       #0.8126
31            64.8548       #0.8156
32            67.5938       #0.8037

/d/prj/3rdparty/gmp-32/tune$> ./speed.exe -Cr -s2-32 mpn_sqr_basecase
mpn_sqr_basecase_vb
overhead 5.29 cycles, precision 10000 units of 3.58e-010 secs, CPU freq
2794.00 MHz
        mpn_sqr_basecase mpn_sqr_basecase_vb
2             11.0504       #0.7342
3             12.0921       #0.6959
4             12.2253       #0.7774
5             13.4742       #0.8283
6             14.0847       #0.8827
7             14.8528       #0.9195
8             15.6207       #0.9451
9             16.5070       #0.9704
10            17.5082       #0.9794
11            18.3551       #0.9722
12            19.5435       #0.9555
13            20.4577       #0.9553
14            21.4923       #0.9464
15            22.4375       #0.9387
16            23.4310       #0.9345
17            24.3137       #0.9337
18            25.4167       #0.9304
19            26.3612       #0.9284
20            27.3075       #0.9253
21            28.2782       #0.9174
22            29.6847       #0.8977
23            30.2853       #0.9642
24            31.2333       #0.9028
25            32.2523       #0.8975
26            33.8609       #0.8748
27            34.6944       #0.8780
28            35.8929       #0.8622
29            36.1448       #0.8828
30            36.9900       #0.8757
31            38.1935       #0.8705
32            39.1875       #0.8672

Some implementation details:
- only SSE2 instructions are used.
- Column multiplication is unrolled to 4 multiplications per loop.
- All data loads are done using unaligned instruction "muldqu" to load 4
limbs at time.
- Computation of 4 first and last columns is unrolled. And they share input
data (4 limbs).

The main unit of computation does 2 multiplications at time.
"4 multiplications" loop unit is just a 2 interleaved instances
of the code below:
---------------------------------------
    pshufd      $0xXX,SRC_A,VA0 # [?,a1,?,a0]
    pshufd      $0xXX,SRC_B,VB0 # [?,b1,?,b0]
    pmuludq     VB0,VA0         # [(r11,r10)=a1*b1,(r01,r00)=a0*b0]
    movdqa      VA0,VB0
    punpckhdq   ZERO,VA0        # [0,r11,0,r10]
    punpckldq   ZERO,VB0        # [0,r01,0,r00]
    paddq       VA0,ACC         # [(p11,p10),(p01,p00)] += [(0,r11),(0,r10)]
    paddq       VB0,ACC         # [(p11,p10),(p01,p00)] += [(0,r01),(0,r00)]
---------------------------------------

Column product save & shift looks like this:
---------------------------------------
    movd        ACC,(memory)   # [p11,p10,p01,p00]; p00 ==> memory
    movq        ACC,SCRATCH    # [0,0,p01,p00]
    psrldq      $8,ACC         # [0,0,p11,p10]
    psrldq      $4,SCRATCH     # [0,0,0,p01]
    paddq       SCRATCH,VACC   # [(0,0),(p11,p10)] += [(0,0),(0,p01)]
---------------------------------------

Thanks,
Vasili