VPMADD52

Mon Oct 12 14:43:43 UTC 2015

Well, that's impressive! Is this the same code that you had 2-3 years ago, or has it evolved since then? I think the inner loop of my code is pretty much the same as what you had then. If it's changed, is the new code available? 

David Harvey <d.harvey at unsw.edu.au> wrote:
>
>On 12 Oct 2015, at 9:29 am, Niels Möller <nisse at lysator.liu.se> wrote:
>
>> Victor Shoup <shoup at cs.nyu.edu> writes:
>> 
>>> There is a complement of " blend" instructions which should do the
>job.
>> 
>> I'll have to look that up.
>> 
>>> For FFTs, it might also make sense to stick to 30 bit primes...
>> 
>> The best code I'm aware of is David Harvey's. It was for the AMD
>> Opteron, which issues one multiply every second cycle, and the code
>did
>> one butterfly in 6 cycles (i.e., 100% multiplier utilization, using
>the
>> trick we attribute to you, and some additional tricks). It used
>primes
>> of 62 bits, iirc. So that's what any SIMD fft code has to beat.
>> 
>> See
>http://web.maths.unsw.edu.au/~davidharvey/talks/fastntt-2-talk.pdf
>
>one quick updated data point, using my current code, which is based on
>the same ideas. This is mainly C code, with inline assembly for the
>64*64 multiply. On my laptop (which I believe has an intel 4650U chip)
>the sweet spot is transform length 2^11, where it averages about 2.8
>cycles per butterfly. These are 62-bit primes.
>
>I believe this chip can manage one multiply per cycle (?), so it should
>be 3 cycles per butterfly at best... but there is no contradiction, as
>this profile also includes a bunch of multiply-free butterflies at the
>bottom levels.
>
>david

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.