Any interest in multi-threaded code?
Niels Möller
nisse at lysator.liu.se
Wed Apr 4 19:41:38 UTC 2018
"Richard Foster" <r-foster at gmx.com> writes:
> Even it looks OK in today's code, does Nisse's new FFT code change
> anything?
The small-prime FFT code uses fft-based multiplication of polynomials
over Z_p for a couple of single-limb primes p. Usually in the range from
3 to 5 primes, depending on how the input size fits constraints on transform
size, and it's probably beneficial to go a bit beyond 5.
The FFTs are organized to spend most of the time doing small FFTs
fitting the L1 cache and implemented as tight assembly loops, and on top
of that somewhat hairy logic doing twiddle factors, data transpositions,
truncated fft logic, and the like.
Regarding threading, the best strategy will likely differ depending on
the number of cores you target. For a small number of cores, it seems
reasonable to let one core handle each p, doing all of forward
transforms, pointwise multiply, and inverse transforms. They would then
work independently, with only an initial pass to distrivute the mod p
work to threads, and a final pass to combine mod p results using crt and
assemble the final product. I'm not that familiar with threading, but
I'd expect a reasonable speedup for large operands, aiming to have each
core mainly exercise its local L1 cache. If cores end up competing for
available cache or memory bandwidth, gain will be smaller.
It's also possible to try to spawn worker threads to do the smallest mod
p FFTs, but as Torbjörn says, there's a risk that lots of data has to be
passed from one core to another. So I'd first try to split computation
work between threads in much larger pieces than a single small FFT at a
time.
As for performance, the assembly loops when I worked on this (which is
close to 10(!) years ago) ran at some 8-10 cycles per butterfly, on the
AMD Opteron which was high end at the time.
I'm aware of a trick by David Harvey, who got running time down to only
6 cycles per butterfly (totally saturating the multiplication units) on
the same hardware, by using primes close to 2^63 rather than close to
2^64.
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list