SSE2 basecase multiplication

Sat Dec 7 21:32:29 UTC 2013

Vasili Burdo <vasili.burdo at gmail.com> writes:

  I implemented basecase multiplication and squaring for x86 using SSE2
  instructions and Comba column-wise multiplication method.
  On Ivy Bridge (Intel Core i7 3517U) multiplication 10-20% faster than
  present GMP basecase MMX multiplication.
  Squaring is 5-10% faster than GMP MMX version.
  However, on older CPU (Core 2 DUO E7500, Wolfdale) the same code is 15-30
  worse than GMP MMX version.

The present code uses SSE2 instructions, I think.  But perhaps you mean
that we use mmx registers instead of xmm registers?

  What is good about SIMD and Comba - they perfectly match each other. It's
  easy to do 2 or more multiplications in parallel.
  But, to my surprize, the gain was not so good. I would expect at least 50%
  over MMX code. Moreover 32-bit x86 is nearly obsolete.

I have the same experience, getting any real gain from SIMD operations
on x86 is hard.  The instructions are simply not properly designed.

I am not to fussed about x86-32 performance, but if we can make a
significant speedup, then we should consider it putting it in GMP.

It might even be possible to do four 32 x 32 multiplies at a time on
Haswell, using vpmuludq.

Note also that a full vertical multiply might not be the best solution,
but that addmul_(k) for k >= 2 will have many of the same advantages.
An indirect disadvantage of vertical multiply is that trip counts will
change in the inner loop for each outer loop iteration.  That adds
cycles due to mispredictions.

Did you test your code with tests/devel/try?

  I'm going to try the same approach for ARM NEON. NEON instruction set is
  more elegant, than SSE2, so I'm rather optimistic
  to beat current ARM GMP multiplication...

Please start with the current repository code, not the code from GMP
5.1.  Also, check out last springs ARM work on gmp-devel, in particular
https://gmplib.org/list-archives/gmp-devel/2013-April/003299.html.

A lot of work has been done, but much more could be done.  The multiply
workhorse for A9 and A15 is addmul_3, but perhaps addmul_2 could be sped
up to make addmul_3 superfluous.  With Neon, I expect a mul_basecase
could reach 1 cycle per 32 x 32 product accumulation on A15.

See also https://gmplib.org/devel/asm.html for an overview of ARM's (and
lots of other processors') performance.

Torbjörn