SSE2 basecase multiplication
Vasili Burdo
vasili.burdo at gmail.com
Thu Dec 5 20:45:47 UTC 2013
Hi.
I implemented basecase multiplication and squaring for x86 using SSE2
instructions and Comba column-wise multiplication method.
On Ivy Bridge (Intel Core i7 3517U) multiplication 10-20% faster than
present GMP basecase MMX multiplication.
Squaring is 5-10% faster than GMP MMX version.
However, on older CPU (Core 2 DUO E7500, Wolfdale) the same code is 15-30
worse than GMP MMX version.
What is good about SIMD and Comba - they perfectly match each other. It's
easy to do 2 or more multiplications in parallel.
But, to my surprize, the gain was not so good. I would expect at least 50%
over MMX code. Moreover 32-bit x86 is nearly obsolete.
I'm going to try the same approach for ARM NEON. NEON instruction set is
more elegant, than SSE2, so I'm rather optimistic
to beat current ARM GMP multiplication...
If anybody interested, I will share my present (GNU assembly and C/C++ SSE
intrinsic) implementation of basecase MUL and SQR.
Here are results (Windows, MINGW32):
~~~~~~~~~~~~~~~~~
/d/prj/3rdparty/gmp-32/tune$> ./speed.exe -Cr -s2-64 mpn_mul_basecase
mpn_mul_basecase_vb
overhead 5.29 cycles, precision 10000 units of 3.58e-010 secs, CPU freq
2794.00 MHz
mpn_mul_basecase mpn_mul_basecase_vb
2 13.2628 #0.7281
3 12.8256 #0.8463
4 14.7880 #0.8549
5 15.7579 #0.9141
6 16.9703 #0.9563
7 18.8107 #0.9472
8 20.7070 #0.9374
9 22.4533 #0.9253
10 24.2333 #0.9179
11 26.1746 #0.9048
12 28.0606 #0.8986
13 30.0165 #0.8832
14 32.2114 #0.8684
15 33.6000 #0.8764
16 35.6812 #0.8637
17 37.6875 #0.8777
18 39.8452 #0.8755
19 41.4899 #0.8687
20 44.2875 #0.8339
21 46.4935 #0.8316
22 48.4636 #0.8322
23 50.3333 #0.8283
24 52.2222 #0.8210
25 53.5350 #0.8298
26 55.8297 #0.8223
27 57.7619 #0.8173
28 59.5867 #0.8148
29 62.5862 #0.8038
30 63.4333 #0.8126
31 64.8548 #0.8156
32 67.5938 #0.8037
/d/prj/3rdparty/gmp-32/tune$> ./speed.exe -Cr -s2-32 mpn_sqr_basecase
mpn_sqr_basecase_vb
overhead 5.29 cycles, precision 10000 units of 3.58e-010 secs, CPU freq
2794.00 MHz
mpn_sqr_basecase mpn_sqr_basecase_vb
2 11.0504 #0.7342
3 12.0921 #0.6959
4 12.2253 #0.7774
5 13.4742 #0.8283
6 14.0847 #0.8827
7 14.8528 #0.9195
8 15.6207 #0.9451
9 16.5070 #0.9704
10 17.5082 #0.9794
11 18.3551 #0.9722
12 19.5435 #0.9555
13 20.4577 #0.9553
14 21.4923 #0.9464
15 22.4375 #0.9387
16 23.4310 #0.9345
17 24.3137 #0.9337
18 25.4167 #0.9304
19 26.3612 #0.9284
20 27.3075 #0.9253
21 28.2782 #0.9174
22 29.6847 #0.8977
23 30.2853 #0.9642
24 31.2333 #0.9028
25 32.2523 #0.8975
26 33.8609 #0.8748
27 34.6944 #0.8780
28 35.8929 #0.8622
29 36.1448 #0.8828
30 36.9900 #0.8757
31 38.1935 #0.8705
32 39.1875 #0.8672
Some implementation details:
- only SSE2 instructions are used.
- Column multiplication is unrolled to 4 multiplications per loop.
- All data loads are done using unaligned instruction "muldqu" to load 4
limbs at time.
- Computation of 4 first and last columns is unrolled. And they share input
data (4 limbs).
The main unit of computation does 2 multiplications at time.
"4 multiplications" loop unit is just a 2 interleaved instances
of the code below:
---------------------------------------
pshufd $0xXX,SRC_A,VA0 # [?,a1,?,a0]
pshufd $0xXX,SRC_B,VB0 # [?,b1,?,b0]
pmuludq VB0,VA0 # [(r11,r10)=a1*b1,(r01,r00)=a0*b0]
movdqa VA0,VB0
punpckhdq ZERO,VA0 # [0,r11,0,r10]
punpckldq ZERO,VB0 # [0,r01,0,r00]
paddq VA0,ACC # [(p11,p10),(p01,p00)] += [(0,r11),(0,r10)]
paddq VB0,ACC # [(p11,p10),(p01,p00)] += [(0,r01),(0,r00)]
---------------------------------------
Column product save & shift looks like this:
---------------------------------------
movd ACC,(memory) # [p11,p10,p01,p00]; p00 ==> memory
movq ACC,SCRATCH # [0,0,p01,p00]
psrldq $8,ACC # [0,0,p11,p10]
psrldq $4,SCRATCH # [0,0,0,p01]
paddq SCRATCH,VACC # [(0,0),(p11,p10)] += [(0,0),(0,p01)]
---------------------------------------
Thanks,
Vasili
More information about the gmp-devel
mailing list