GPGPU support

Mon Oct 11 11:43:07 CEST 2010

> > I have an idea to approach SIMD differently for this kind of parallel
> > processing (using algorithms which are not inherently geared toward a
> > native port to SIMD).  Is there someone on this list who would be
> > willing to discuss this possibility with me?

> Of course i'm very interested in all this.
> Parallellization using vector cores (not so much SIMD) is very complicated.
> 
> The dominant hardwqare is AMD currently. It delivers 4.6+ or more tflop
> (single precision) versus nvidia 1+ Tflop.
> 
> Most projects are CUDA, for unknown reason to me.

CUDA probably because they have higher-level library support.  AMD's CTM
is a little low-level for most.

My thoughts on this idea are these, and forgive me because I have not
written these down formally, but have them off the top of my head:

Basically, create a back-end compiler that explicitly knows about CPU +
GPU implementation for the target (such as x86-64 / AMD), one that uses
a combination of CPU + GPU synchronization to conduct the vector
computing based on the regular C code rules for a single operation, but
applied to the logic necessary over multiple vector lanes.

Normally, the C code would be compiled into something the CPU would
serialize into the binary, at runtime it would execute, testing the
logic, performing the operations, etc.

In this new back-end, it would take the front-end parsed set of
instructions on what to execute, and how to conduct logic, and instead
of generating the output to conduct that logic to a single serial
instruction stream which would execute on the CPU, it would instead
generate the logic equivalent of what's being required as is necessary
for vector computing, meaning all logic at each stage interacts with the
CPU to determine what operations go in to each lane on successive
iterations based on whatever the natural numbers being computed
indicate.

This results in the original application / algorithm being written
entirely in C (at first, C++ possibly later), but with the actual
processing being carried out on the vector processor (GPU) through the
back-end's ability to setup lanes for data, and conduct the necessary
operations on those lanes, as is dictated by the original program /
algorithm logic.

It would handle the vector setup (without changing any of the existing
algorithm cod (meaning it would still receive whatever parameters are
required), but instead of using that data directly, it would flatly
ignore it, and reference the arrays which were setup from a preceding
call to a "vector setup" function, which would determine the length of
the items to be processed in the array, the input data items, and where
to store output.

Then, and after that vector setup function in code, the normal call to
the conduct normal processing is made, which then, instead of processing
on the CPU on the single set of input parameters, actually ignores those
parameters completely, and references the
previously-setup-by-the-call-to-the-vector-setup-function values in the
vector engine.

In pseudo-code:

a) I create / load / have some arrays, say output, input1, input2, which
contain 64 items to process.
b) I call something like "gmp_vector_setup(output, input1, input2, 64)"
which sets up the "invisible variables" that were created / required by
the new back-end, ones that the vector engine uses internally.
c) Then I call something like "gmp_add(a, b, c)" to add all 64 input2
values to input1 values, storing the results in output variables.

Internally, after each series of operation are computed (as they would
be on the CPU), and some bit of logic is encountered to test some
condition, each lane is tested individually to determine its state, and
then the appropriate continuing operation for that lane is setup.  All
processing continues on all lanes, with each lane being "completed"
executing the equivalent of a NOP, until the last lane is completed.

The new vector back-end would be programmed to handle the multiple
branching conditions and operations based on the observed conditions of
each vector lane as they are revealed through real-world data
processing, just as the single-stream CPU algorithm would've been
before.

Using this approach, the logic of the original GMP algorithms would not
need to be changed (nor the logic of any other algorithms / programs
which exist out there in the world that could be vectorized like this),
an algorithm that works on the CPU would not need to be changed to work
on this kind of vector engine as it could be coded, tested and developed
on the single-stream approach, and then rolled out through the new
back-end compiler which would generate the vector equivalent of its
logic and the necessary other vector setup operations, and all of these
would be transparent to those algorithms, yet with the preceding
"vector_setup" call, would generate the desired results.  This approach
would be slower than if it were custom tailored to be a fully GPU-aware
algorithm, but faster than executing on the CPU alone, and it has the
added advantage of being generic enough to be applied to anything that
can be done in parallel without changing code.

In fact, I believe this approach could work for just about any type of
programming that also relates to multiple repeated computations on large
data sets.  For example, the next application of this ability I would
target, were it to be coded and completed and work properly, would be
MPFR as those functions would benefit greatly as well (at least they
would for my application).

This approach would also allow custom back-ends to be created for many
vector engines, including AMD, NVIDIA and even various flavor / model
revisions within, as well as additional, more generic multi-purpose
vector engines.

A lot of words.  I hope it's conveyed clearly.  Ask any any questions.

What do you think?

- Rick