GMP on the Cell processor

Fri Apr 20 14:56:33 CEST 2007

(Some of you may remember me for unfulfilled promises about using AltiVec 
to speed up GMP on good ole' PowerPC G4 machines. So take my statements 
with a grain of salt.)

On Thu, 19 Apr 2007, Torbjorn Granlund wrote:

> I would not expect it to perform well at all, unfortunately.  Reasons:
>
> * IIRC, the Cell doesn't have a 32x32->64 bit multiply instruction.
>  The lack of such an operation is a disaster for the performance.
>
I have only skimmed the SPU instruction set so far, and believe that it is 
a bit weaker for wide integer multiplies than AltiVec. Still, it is a bit 
better for carry propagation.

Lacking a nice basic instruction for wide multiplies is not such a big 
deal, though, as long as the available multiply instructions do enough 
work in one way or another. AltiVec's "multipy sum" instructions are a 
fairly good example of this. (Something like vector_permute is an 
additional requirement to make such SIMD multiplies viable for bignum 
arithmetic, otherwise the shuffling of partial results is a bottleneck.)

> * Making good use of SIMD for the most critical GMP operations is hard.
>  (See the mpn/cray and perhaps mpn/powerpc*/vmx for some examples.)
>
True, SIMD style parallelism is not an ideal fit for bignum arithmetic. 
The usual reason is the sequential dependency on carry propagation.

But multiplication has a lot of inherent parallelism, and often accounts 
for a large percentage of program runtime, so there is good reason to try 
and vectorize it.

> * Using multiple cores for any real GMP peformance benefit is probably
>  very difficult.
>
Well, the Cell has extraordinarily good communication between the various 
cores, both in terms of latency and bandwidth. If it is at all feasible to 
partition a single huge number and distribute work to several processor 
cores, then the Cell is probably better suited to it than any other widely
available architecture.

   Holger