GMP and CUDA

Sat Jun 13 00:32:10 CEST 2009

Cuda is mighty interesting Allan,
note that for what GMP is doing on paper ATI/AMD has faster hardware  
and cheaper.
The bandwidth unlimited cuda card is the tesla, which is very very  
expensive for what it
delivers.

Basically it is a 32 bits platform, no matter the idiotic claims  
there, and there is no description from
Nvidia which hardware instructrions they support. So support of  
anything on Tesla is a total fulltime job
as you first have to figure out how and what.

AMD/ATI released recently more information on their platform.

Compare also prices: A $7500-$10k CUDA platform tesla with 4 gpu's it
has 960 stream processors at 1.2Ghz versus a single X2 card from AMD
which is a few hundreds of euri, it is having 1600 execution units.  
The last one
you can see basically as a 320 processor card with 5 execution units.

There is of course more cache then on AMD for each 'processor'  
available.
Now to keep busy 5 execution units is not easy of course. Finding the  
right instruction mix is
very complicated just like it is on a PC.

 From organisations programming for both hardware, so the GPU  
versions, not the Tesla version,
it is well known that with some big effort they can keep Nvidia's  
stream processors effectively busy
around 20% and AMD/ATI about 40%.

Maybe with GMP you can get things a bit more effective to work than  
that, as you require more overhead
for supporting real huge integers.

Big problem at both manufacturers is of course limited cache. I read  
somewhere it is complicated to allocate
more than 400MB ram at nvidia gpu's, only tesla allows allocating  
more. That is called Device RAM.

For a simple implementation of a FFT that can do multiplication of  
big integers i've been toying on paper a little
how to keep effectively busy a nvidia card doing that. I've been  
focussing upon the multiply add possibility that nvidia
has there.

Regrettably i lack funding to buy a good card (and computer with pci- 
e 16x slot to put it in) to toy with.

Seems to me that a transform i implemented in integers, which is a  
similar transform nowadays GMP also uses,
is possible to implement in several phases into nvidia/AMD.

First phase is write code that can within register files keep busy at  
nvidia a single block of threads and only seldom
go to device RAM. Effective usage of all cores is a very important  
thing, because most of those software programs for gpu's,
that some 'companies' use are having an IPC effectively of 20% or so  
on them (scroll up). A single chinese researching group
reports 40% at CUDA cards, yet they are the only exception so far  
doing such claim for software that matters.

We have to face that GPU's scale easily further and are simply  
practical a lot faster than a cpu.
To compete with great assembler level of GMP, it requires quite some  
work at such GPU.

It's not real fair to compare a 60 watt cpu with a 300 watts GPU of  
course.

I'm always looking at how much power the whole thing eats.
If you use 'cheapo' gpu, you also must include the cpu.

Maybe for now the most clever thing is to do some computationable  
intensive calculations on the GPU and stream results to
CPU and do all kind of weirdo functions from gmp library at cpu.

Note bandwidth from gpu to cpu is very little, so you really must do  
some heavy calculations onto gpu then.

Note2 : some will claim some gpu's are true double precision. This is  
right, however the problem is it only has a FEW execution units
that is double precision (latest nvidia cards). These simply do not  
beat cheapo quadcores.

Finding an optimal mix always involves single precision multiply add  
calculations. Without focussing upon single precision
multiply add you can throw away the GPU as there is its only  
advantage over CPU's. It is 1 instruction for it, whereas it counts  
as 8 flops ( vector of 4 ).
Without it you have 4 flops a cycle a stream processor, so any  
claimed speed you can already start dividing claimed performance by 2.

How about that?

CPU's are way more efficient there.

On Jun 8, 2009, at 9:33 AM, Allan Menezes wrote:

> Hi,
>   Is it easy or possible to port GMP version 4.3.1 to the CUDA  
> platform?
> I notice that GMP uses assembly level coding for some of it lower  
> routines in the subdirectory ../mpn
> Also will licensing of the CUDA software which is available for  
> linux too be a problem with the terms of LGPL?
> Thank you,
> Allan MeneZes
> _______________________________________________
> gmp-discuss mailing list
> gmp-discuss at gmplib.org
> https://gmplib.org/mailman/listinfo/gmp-discuss
>