VPMADD52

Mon Oct 12 13:54:03 UTC 2015

Yes, I'm familiar with David Harvey's ideas. I've implemented them in NTL, but not completely in assembly. 

30 bit primes may indeed be too small for your application. With the integer ifma52 instructions, you are stuck at 52 bits... Or just 50 if you want to use David's tricks. 

Also, Google mathemagix simd... They have a paper that has some nice ideas for simd ntt's

nisse at lysator.liu.se wrote:
>Victor Shoup <shoup at cs.nyu.edu> writes:
>
>> There is a complement of " blend" instructions which should do the
>job.
>
>I'll have to look that up.
>
>> For FFTs, it might also make sense to stick to 30 bit primes...
>
>The best code I'm aware of is David Harvey's. It was for the AMD
>Opteron, which issues one multiply every second cycle, and the code did
>one butterfly in 6 cycles (i.e., 100% multiplier utilization, using the
>trick we attribute to you, and some additional tricks). It used primes
>of 62 bits, iirc. So that's what any SIMD fft code has to beat.
>
>See http://web.maths.unsw.edu.au/~davidharvey/talks/fastntt-2-talk.pdf
>
>My old code uses 63-bit primes (or possibly 64, I don't remember?), and
>needed 9 cycles.
>
>Furthermore, I think one objective for small-prime FFT code in GMP is
>to
>be able to multiply numbers so large that a transform length beyond
>32-bits is needed, and then 32-bit primes are too small.
>
>Regards,
>/Niels
>
>-- 
>Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
>Internet email is subject to wholesale government surveillance.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.