Any interest in multi-threaded code?

Victor Shoup shoup at
Wed Apr 4 19:58:05 UTC 2018

Hi all,

Since this FFT code is sitting around, I wonder if there's a chance I could adapt it to NTL. I've actually implemented David Harvey's ideas a few years ago (and his work is actually based on older NTL logic which I never really publicized).  NTL currently only uses special types like __uint128_t (or whatever it's called) or a few inline asm code sequences. What's missing is logic to keep things in the L1 cache, so there is definitely much room for improvement. 

I've planned on doing these improvements myself for some time, but haven't yet found the time.  If I could start from an existing code base and adapt it, that would probably speed things up. Hopefully, as long as everything is LGPL compatible and nobody minds me adapting their code, this would work out. 

So I guess I'm just asking where the FFT code is at and if there would be any objections to my using it in that way. 

BTW, I've done the multithreaded implementation of this type of thing. It works great, although in my setting, the number of primes is usually larger (like a dozen or more).


> On Apr 4, 2018, at 3:41 PM, Niels Möller <nisse at> wrote:
> "Richard Foster" <r-foster at> writes:
>> Even it looks OK in today's code, does Nisse's new FFT code change
>> anything?
> The small-prime FFT code uses fft-based multiplication of polynomials
> over Z_p for a couple of single-limb primes p. Usually in the range from
> 3 to 5 primes, depending on how the input size fits constraints on transform
> size, and it's probably beneficial to go a bit beyond 5.
> The FFTs are organized to spend most of the time doing small FFTs
> fitting the L1 cache and implemented as tight assembly loops, and on top
> of that somewhat hairy logic doing twiddle factors, data transpositions,
> truncated fft logic, and the like.
> Regarding threading, the best strategy will likely differ depending on
> the number of cores you target. For a small number of cores, it seems
> reasonable to let one core handle each p, doing all of forward
> transforms, pointwise multiply, and inverse transforms. They would then
> work independently, with only an initial pass to distrivute the mod p
> work to threads, and a final pass to combine mod p results using crt and
> assemble the final product. I'm not that familiar with threading, but
> I'd expect a reasonable speedup for large operands, aiming to have each
> core mainly exercise its local L1 cache. If cores end up competing for
> available cache or memory bandwidth, gain will be smaller.
> It's also possible to try to spawn worker threads to do the smallest mod
> p FFTs, but as Torbjörn says, there's a risk that lots of data has to be
> passed from one core to another. So I'd first try to split computation
> work between threads in much larger pieces than a single small FFT at a
> time.
> As for performance, the assembly loops when I worked on this (which is
> close to 10(!) years ago) ran at some 8-10 cycles per butterfly, on the
> AMD Opteron which was high end at the time.
> I'm aware of a trick by David Harvey, who got running time down to only
> 6 cycles per butterfly (totally saturating the multiplication units) on
> the same hardware, by using primes close to 2^63 rather than close to
> 2^64.
> Regards,
> /Niels
> -- 
> Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
> Internet email is subject to wholesale government surveillance.
> _______________________________________________
> gmp-devel mailing list
> gmp-devel at

More information about the gmp-devel mailing list