GMP and CUDA
Vincent Diepeveen
diep at xs4all.nl
Sat Jun 13 00:32:10 CEST 2009
Cuda is mighty interesting Allan,
note that for what GMP is doing on paper ATI/AMD has faster hardware
and cheaper.
The bandwidth unlimited cuda card is the tesla, which is very very
expensive for what it
delivers.
Basically it is a 32 bits platform, no matter the idiotic claims
there, and there is no description from
Nvidia which hardware instructrions they support. So support of
anything on Tesla is a total fulltime job
as you first have to figure out how and what.
AMD/ATI released recently more information on their platform.
Compare also prices: A $7500-$10k CUDA platform tesla with 4 gpu's it
has 960 stream processors at 1.2Ghz versus a single X2 card from AMD
which is a few hundreds of euri, it is having 1600 execution units.
The last one
you can see basically as a 320 processor card with 5 execution units.
There is of course more cache then on AMD for each 'processor'
available.
Now to keep busy 5 execution units is not easy of course. Finding the
right instruction mix is
very complicated just like it is on a PC.
From organisations programming for both hardware, so the GPU
versions, not the Tesla version,
it is well known that with some big effort they can keep Nvidia's
stream processors effectively busy
around 20% and AMD/ATI about 40%.
Maybe with GMP you can get things a bit more effective to work than
that, as you require more overhead
for supporting real huge integers.
Big problem at both manufacturers is of course limited cache. I read
somewhere it is complicated to allocate
more than 400MB ram at nvidia gpu's, only tesla allows allocating
more. That is called Device RAM.
For a simple implementation of a FFT that can do multiplication of
big integers i've been toying on paper a little
how to keep effectively busy a nvidia card doing that. I've been
focussing upon the multiply add possibility that nvidia
has there.
Regrettably i lack funding to buy a good card (and computer with pci-
e 16x slot to put it in) to toy with.
Seems to me that a transform i implemented in integers, which is a
similar transform nowadays GMP also uses,
is possible to implement in several phases into nvidia/AMD.
First phase is write code that can within register files keep busy at
nvidia a single block of threads and only seldom
go to device RAM. Effective usage of all cores is a very important
thing, because most of those software programs for gpu's,
that some 'companies' use are having an IPC effectively of 20% or so
on them (scroll up). A single chinese researching group
reports 40% at CUDA cards, yet they are the only exception so far
doing such claim for software that matters.
We have to face that GPU's scale easily further and are simply
practical a lot faster than a cpu.
To compete with great assembler level of GMP, it requires quite some
work at such GPU.
It's not real fair to compare a 60 watt cpu with a 300 watts GPU of
course.
I'm always looking at how much power the whole thing eats.
If you use 'cheapo' gpu, you also must include the cpu.
Maybe for now the most clever thing is to do some computationable
intensive calculations on the GPU and stream results to
CPU and do all kind of weirdo functions from gmp library at cpu.
Note bandwidth from gpu to cpu is very little, so you really must do
some heavy calculations onto gpu then.
Note2 : some will claim some gpu's are true double precision. This is
right, however the problem is it only has a FEW execution units
that is double precision (latest nvidia cards). These simply do not
beat cheapo quadcores.
Finding an optimal mix always involves single precision multiply add
calculations. Without focussing upon single precision
multiply add you can throw away the GPU as there is its only
advantage over CPU's. It is 1 instruction for it, whereas it counts
as 8 flops ( vector of 4 ).
Without it you have 4 flops a cycle a stream processor, so any
claimed speed you can already start dividing claimed performance by 2.
How about that?
CPU's are way more efficient there.
On Jun 8, 2009, at 9:33 AM, Allan Menezes wrote:
> Hi,
> Is it easy or possible to port GMP version 4.3.1 to the CUDA
> platform?
> I notice that GMP uses assembly level coding for some of it lower
> routines in the subdirectory ../mpn
> Also will licensing of the CUDA software which is available for
> linux too be a problem with the terms of LGPL?
> Thank you,
> Allan MeneZes
> _______________________________________________
> gmp-discuss mailing list
> gmp-discuss at gmplib.org
> https://gmplib.org/mailman/listinfo/gmp-discuss
>
More information about the gmp-discuss
mailing list