GPGPU support

Tue Oct 12 01:37:54 CEST 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

Rick C. Hodgin wrote:
>> Hi,
>> I'm not really sure which kind of computations gmp uses, but in my
>> experience matrix-computations and ffts can be done really well on
>> gpus. This is why cuda is used in most projects. It provides cuBLAS
>> and cuFFT, so that you don't have to care about the low-level cuda
>> implementation and can just use the cuda-kernels in your normal
>> c++-code.
>> For example it is relatively straight forward to change the
>> lapack-code to run on a nvidia-gpu.
>> Yours sincerely
>> Robert Schade
> 
> Robert, thanks for that info.  I'm thinking there has to be a way to
> approach the vector-like abilities of any GPU engine through a generic
> modification to the compiler back-end.  The logic rules are there for a
> single operation, and if we can create a model which uses those rules in
> a vector instead of a single computing stream, then every application
> out there that makes use of computational logic could be immediately
> migrated to the GPU using this approach.
> 
> gvcc is my proposed naming convention.  It uses the GNU C front-end, and
> the vector component mimics using GPU primitives the same capabilities
> as the CPU would, but always with vector lanes in mind, keeping each one
> where it needs to be throughout the entire function.
> 
> I think this would be possible.  Difficult, but possible, and it would
> solve virtually every issue I've seen to date about serial algorithms
> being unable to reap any benefits from the GPU.  From a programmer's
> point of view, it would be something like having the compiler create
> multiple threads for you, but instead of running on CPUs governed by the
> kernel, they're running on the GPU governed by the specifics of the
> back-end engine used at the time of compile.
> 
> I have a whole slew of steps in mind on how we can get from "does not
> exist today" to "it appears to be working on these logic ops" to "it
> appears to be working on these processing ops" to full-blown
> applications.
> 
> gvcc -- un-serializing the un-serializable. :-)
> 
> - Rick
> 

Hi List readers,

I feel this is more a discussion about crafting a Paralell compiler than
porting a c library like gmp to a different Hardware Architecture, like
CUDA. After seing a demonstration about this LLVM compiler infrastructure:

http://en.wikipedia.org/wiki/Llvm

there are two approaches,

1)  complaining to Gnu about GCC

2)  hand coding in assembly so to get around bugs.

Both is tedious,

from here http://llvm.org/   I did search for CUDA and noticed some
research and articles about the topic. I feel when we enter the
multi-core computing we are approaching true paralellism, we will then
be bound to Amdahls law and Gustafson's remarks to Amdahls law. We will
also leave Van Neuman's Architecture, (Only one core, and no more[TM])
and then I start to feel on very thin ice.

Any C compiler has the lions share of all van Neuman Architecture, which
says only one core. But true paralellism may need some different
approach, Maybe a different Compiler and a different language? I feel
gmp is excellent the way it is, and I'd like to know more details on
this table:

http://gmplib.org/pi-with-gmp.html

AMD
Athlon (K8)
2.2 GHz 	
AMD
Phenom II (K10)
3.2 GHz 	
Intel
Pentium 4
3.4 GHz 	
Intel
Core 2
2.13 GHz 	
Intel
Core i7
2.67 GHz 	
PowerPC 970
1.6 GHz	
Itanium 2
0.9 GHz

How many cores was deployed?  Intel Core i7 has six cores, AMD Phenom II
(K10) I have tried with four cores and got nearly the same results for
the first approaches at 100 000 digits, will an increase or decrease in
the actual involved cores give any speedup close to what Amdahls law
predicts?

Parallel computing makes sense to me in that an image rendering can be
subdivided into squares, which can be simultaneously rendered. Even
porting gmp to Microsoft Windows and compiling with visual studio is not
trivial.

Porting it to CUDA or llvm could be Not Such a Good idea, instead of
casting it into hardware, as a core CPU for number crunching. Multi core
computing and architecture can also be assymetric, meaning that some
cores does a different task. And the rest is idle, which saves
Electricity. There are some AES hardware extensions in some recent Intel
CPUs. IIRC.

On my old machine I get a speed of approx 60 times slower than the table
says. with one core pentium II 366 MHz and 256 MB RAM. but only for
results up to 10 ^ 7 digits. 10 ^ 8 digits takes more than 6-8 hours. on
32 bit x86.

My question is: will many cores or paralell computing help with
gmp-Chudnowski? Just for comparison? Why is not the number of cores
reported? I assume all results was 64 bits as an increase from 32 bits
to 64 bits will speedup with a factor four times.

If so, (if multi-core speeds up) I think we should go for it, If Not,
if the same results are achieved with one core as with many cores, I'll
look for a different approach.

Sincerely yours,

藻留天   具留部覧度船
Morten Gulbrandsen
_____________________________________________________________________
Java programmer, C++ programmer
CAcert Assurer, GSWoT introducer, thawte Notary
Gossamer Spider Web of Trust http://www.gswot.org
Please consider the environment before printing this e-mail!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: For keyID and its URL see the OpenPGP message header

iEYEAREKAAYFAkyzn9EACgkQ9ymv2YGAKVRwCwCfeU2YSvyVb3GDE/XEEELriRJF
GQYAnRGoyPVrY/3XGHPzWP1K1V6vVKII
=ndhV
-----END PGP SIGNATURE-----