ravenschade at googlemail.com
Mon Oct 11 12:41:51 CEST 2010
2010/10/11 Vincent Diepeveen <diep at xs4all.nl>:
>>> > I have an idea to approach SIMD differently for this kind of parallel
>>> > processing (using algorithms which are not inherently geared toward a
>>> > native port to SIMD). Is there someone on this list who would be
>>> > willing to discuss this possibility with me?
>>> Of course i'm very interested in all this.
>>> Parallellization using vector cores (not so much SIMD) is very
>>> The dominant hardwqare is AMD currently. It delivers 4.6+ or more tflop
>>> (single precision) versus nvidia 1+ Tflop.
>>> Most projects are CUDA, for unknown reason to me.
>> CUDA probably because they have higher-level library support. AMD's CTM
>> is a little low-level for most.
> Some years ago you simply could not download CTM.
>> My thoughts on this idea are these, and forgive me because I have not
>> written these down formally, but have them off the top of my head:
>> Basically, create a back-end compiler that explicitly knows about CPU +
>> GPU implementation for the target (such as x86-64 / AMD), one that uses
>> a combination of CPU + GPU synchronization to conduct the vector
>> computing based on the regular C code rules for a single operation, but
>> applied to the logic necessary over multiple vector lanes.
> back-end compiler, pardon me?
> GPU's are very sensitive things. They are not very compatible.
> Some things they can do very fast if you , very low level, program them
> and know how to vectorize things.
> Everything else they suck at or simply cannot do.
> So only a few specific things are interesting to do inside a GPU.
> Generic compilers are very compatible things, my expertise is low level,
> not high level wasting cycles :)
>> Normally, the C code would be compiled into something the CPU would
>> serialize into the binary, at runtime it would execute, testing the
>> logic, performing the operations, etc.
>> In this new back-end, it would take the front-end parsed set of
>> instructions on what to execute, and how to conduct logic, and instead
>> of generating the output to conduct that logic to a single serial
>> instruction stream which would execute on the CPU, it would instead
>> generate the logic equivalent of what's being required as is necessary
>> for vector computing, meaning all logic at each stage interacts with the
>> CPU to determine what operations go in to each lane on successive
>> iterations based on whatever the natural numbers being computed
> A compiler for the GPU already exists, but it is secret AFAIK.
> I'm guessing it exists, as it would answer all questions why no one is
> doing effort to port some math functions to it.
> Yet a generic thing for a gpu is going to be very difficult to implement.
> AMD is the only interesting hardware there as it is cheap and simply
> fastest and 10 years from now still will be fastest.
> I wrote this during work hours, more text when at home.
>> This results in the original application / algorithm being written
>> entirely in C (at first, C++ possibly later), but with the actual
>> processing being carried out on the vector processor (GPU) through the
>> back-end's ability to setup lanes for data, and conduct the necessary
>> operations on those lanes, as is dictated by the original program /
>> algorithm logic.
>> It would handle the vector setup (without changing any of the existing
>> algorithm cod (meaning it would still receive whatever parameters are
>> required), but instead of using that data directly, it would flatly
>> ignore it, and reference the arrays which were setup from a preceding
>> call to a "vector setup" function, which would determine the length of
>> the items to be processed in the array, the input data items, and where
>> to store output.
>> Then, and after that vector setup function in code, the normal call to
>> the conduct normal processing is made, which then, instead of processing
>> on the CPU on the single set of input parameters, actually ignores those
>> parameters completely, and references the
>> previously-setup-by-the-call-to-the-vector-setup-function values in the
>> vector engine.
>> In pseudo-code:
>> a) I create / load / have some arrays, say output, input1, input2, which
>> contain 64 items to process.
>> b) I call something like "gmp_vector_setup(output, input1, input2, 64)"
>> which sets up the "invisible variables" that were created / required by
>> the new back-end, ones that the vector engine uses internally.
>> c) Then I call something like "gmp_add(a, b, c)" to add all 64 input2
>> values to input1 values, storing the results in output variables.
>> Internally, after each series of operation are computed (as they would
>> be on the CPU), and some bit of logic is encountered to test some
>> condition, each lane is tested individually to determine its state, and
>> then the appropriate continuing operation for that lane is setup. All
>> processing continues on all lanes, with each lane being "completed"
>> executing the equivalent of a NOP, until the last lane is completed.
>> The new vector back-end would be programmed to handle the multiple
>> branching conditions and operations based on the observed conditions of
>> each vector lane as they are revealed through real-world data
>> processing, just as the single-stream CPU algorithm would've been
>> Using this approach, the logic of the original GMP algorithms would not
>> need to be changed (nor the logic of any other algorithms / programs
>> which exist out there in the world that could be vectorized like this),
>> an algorithm that works on the CPU would not need to be changed to work
>> on this kind of vector engine as it could be coded, tested and developed
>> on the single-stream approach, and then rolled out through the new
>> back-end compiler which would generate the vector equivalent of its
>> logic and the necessary other vector setup operations, and all of these
>> would be transparent to those algorithms, yet with the preceding
>> "vector_setup" call, would generate the desired results. This approach
>> would be slower than if it were custom tailored to be a fully GPU-aware
>> algorithm, but faster than executing on the CPU alone, and it has the
>> added advantage of being generic enough to be applied to anything that
>> can be done in parallel without changing code.
>> In fact, I believe this approach could work for just about any type of
>> programming that also relates to multiple repeated computations on large
>> data sets. For example, the next application of this ability I would
>> target, were it to be coded and completed and work properly, would be
>> MPFR as those functions would benefit greatly as well (at least they
>> would for my application).
>> This approach would also allow custom back-ends to be created for many
>> vector engines, including AMD, NVIDIA and even various flavor / model
>> revisions within, as well as additional, more generic multi-purpose
>> vector engines.
>> A lot of words. I hope it's conveyed clearly. Ask any any questions.
>> What do you think?
>> - Rick
I'm not really sure which kind of computations gmp uses, but in my
experience matrix-computations and ffts can be done really well on
gpus. This is why cuda is used in most projects. It provides cuBLAS
and cuFFT, so that you don't have to care about the low-level cuda
implementation and can just use the cuda-kernels in your normal
For example it is relatively straight forward to change the
lapack-code to run on a nvidia-gpu.
More information about the gmp-discuss