Any interest in multi-threaded code?

Wed Apr 4 20:25:38 UTC 2018

Presumably, that idea will work.
But based on my experience with multi-threading,
because of various overheads and latencies involved,
you have to make sure that there is enough work to do per
thread to make it worthwhile.  Parallelizing the inner loops
like this may or may not yield the expected speedup.
One would have to try to and see...my intuition and experience
makes me somewhat skeptical, but I'd be happy to be proved wrong.

> On Apr 4, 2018, at 2:56 PM, Richard Foster <r-foster at gmx.com> wrote:
> 
> Hi
> 
>> It is probably not useful to work much on the current FFT code.  We have
>> long planned--and worked quite a bit on--replacement FFT code with much
>> better memory access pattern.
>> 
>> The replacement code works with multiple small coefficient fields
>> instead of a large ring.  It also implements Bailey's multidimensional
>> FFT trick for memory locality improvements.
>> 
>> I have forgotten the exact state of the code, but I am sure Nisse can
>> fill me in.
>> 
>> Any parallel work should go in that code, but only after it is
>> well-tuned for a single thread, I think.
>> 
>> Making GMP scale well for multiple thread is highly non-trivial.  Only
>> superlinear algorithm should even be attempted.  And then, one will
>> want to make it as course grain as possible, perhaps even with some
>> cache topology understanding built in to the code.
>> 
>> Generating data in thread A which is then processed in thread B will
>> typically mean that data needs to be transferred from one cache to
>> another.  That is super expensive, and will offset any benefit untless
>> the comutation before or after the data transfer is huge.
> The basic idea I've been looking at is not to run the FFT itself in parallel, but to parallelise the inner loop. The algorithm recurses until the operand sizes are below a suitable size, then loops K times; each iteration multiplies 2 numbers of n limbs.
> Naively, if you have m CPUs, then you should get a speedup if each CPU multiplies 2 numbers of n/m limbs. There will be a total of mK multiplications, but since each is smaller, it should go faster.
> As Torbjörn says, cache usage is key: but within some uncertainty, the cache requirement should be roughly the same. 2m operands of n/m limbs should need roughly the same total space as 2 operands of n limbs, and the code should take up roughly the same space; at this point in the algorithm all the operands are the same size, so the GMP code will pick the same method for all the iterations (schoolbook, Toom, etc), so there will only be 1 copy of the code in cache - assuming it's re-entrant, as all modern code should be. There is an overhead of a stack per thread, which must be checked out.
> The other assumption is that the forward FFT leaves the operands in a suitable state, ie, each operand is in contiguous memory. This is an advantage anyway, even if you're not running in parallel, so I think it's a reasonable assumption. Also, since each of the smaller operands is a piece of the original, cache locality for the smaller multiplications is no worse than before.
> Finally, the multiplications must leave their results in a suitable state for the inverse FFT. Again, I think this is a reasonable assumption, as the non-parallel version of the loop does the same thing as the parallel version, just with larger operands.
> 
> Does the reasoning appear OK, or am I totally on the wrong track?
> 
> Even it looks OK in today's code, does Nisse's new FFT code change anything?
> 
> Regards
> Richard Foster
> 
> 
> _______________________________________________
> gmp-devel mailing list
> gmp-devel at gmplib.org
> https://gmplib.org/mailman/listinfo/gmp-devel