d.harvey at unsw.edu.au
Mon Oct 12 14:15:54 UTC 2015
On 12 Oct 2015, at 9:29 am, Niels Möller <nisse at lysator.liu.se> wrote:
> Victor Shoup <shoup at cs.nyu.edu> writes:
>> There is a complement of " blend" instructions which should do the job.
> I'll have to look that up.
>> For FFTs, it might also make sense to stick to 30 bit primes...
> The best code I'm aware of is David Harvey's. It was for the AMD
> Opteron, which issues one multiply every second cycle, and the code did
> one butterfly in 6 cycles (i.e., 100% multiplier utilization, using the
> trick we attribute to you, and some additional tricks). It used primes
> of 62 bits, iirc. So that's what any SIMD fft code has to beat.
> See http://web.maths.unsw.edu.au/~davidharvey/talks/fastntt-2-talk.pdf
one quick updated data point, using my current code, which is based on the same ideas. This is mainly C code, with inline assembly for the 64*64 multiply. On my laptop (which I believe has an intel 4650U chip) the sweet spot is transform length 2^11, where it averages about 2.8 cycles per butterfly. These are 62-bit primes.
I believe this chip can manage one multiply per cycle (?), so it should be 3 cycles per butterfly at best... but there is no contradiction, as this profile also includes a bunch of multiply-free butterflies at the bottom levels.
More information about the gmp-devel