GMP on the Cell processor
hobold at informatik.uni-bremen.de
Fri Apr 20 14:56:33 CEST 2007
(Some of you may remember me for unfulfilled promises about using AltiVec
to speed up GMP on good ole' PowerPC G4 machines. So take my statements
with a grain of salt.)
On Thu, 19 Apr 2007, Torbjorn Granlund wrote:
> I would not expect it to perform well at all, unfortunately. Reasons:
> * IIRC, the Cell doesn't have a 32x32->64 bit multiply instruction.
> The lack of such an operation is a disaster for the performance.
I have only skimmed the SPU instruction set so far, and believe that it is
a bit weaker for wide integer multiplies than AltiVec. Still, it is a bit
better for carry propagation.
Lacking a nice basic instruction for wide multiplies is not such a big
deal, though, as long as the available multiply instructions do enough
work in one way or another. AltiVec's "multipy sum" instructions are a
fairly good example of this. (Something like vector_permute is an
additional requirement to make such SIMD multiplies viable for bignum
arithmetic, otherwise the shuffling of partial results is a bottleneck.)
> * Making good use of SIMD for the most critical GMP operations is hard.
> (See the mpn/cray and perhaps mpn/powerpc*/vmx for some examples.)
True, SIMD style parallelism is not an ideal fit for bignum arithmetic.
The usual reason is the sequential dependency on carry propagation.
But multiplication has a lot of inherent parallelism, and often accounts
for a large percentage of program runtime, so there is good reason to try
and vectorize it.
> * Using multiple cores for any real GMP peformance benefit is probably
> very difficult.
Well, the Cell has extraordinarily good communication between the various
cores, both in terms of latency and bandwidth. If it is at all feasible to
partition a single huge number and distribute work to several processor
cores, then the Cell is probably better suited to it than any other widely
More information about the gmp-discuss