VPMADD52

Niels Möller nisse at lysator.liu.se
Mon Oct 12 13:29:58 UTC 2015


Victor Shoup <shoup at cs.nyu.edu> writes:

> There is a complement of " blend" instructions which should do the job.

I'll have to look that up.

> For FFTs, it might also make sense to stick to 30 bit primes...

The best code I'm aware of is David Harvey's. It was for the AMD
Opteron, which issues one multiply every second cycle, and the code did
one butterfly in 6 cycles (i.e., 100% multiplier utilization, using the
trick we attribute to you, and some additional tricks). It used primes
of 62 bits, iirc. So that's what any SIMD fft code has to beat.

See http://web.maths.unsw.edu.au/~davidharvey/talks/fastntt-2-talk.pdf

My old code uses 63-bit primes (or possibly 64, I don't remember?), and
needed 9 cycles.

Furthermore, I think one objective for small-prime FFT code in GMP is to
be able to multiply numbers so large that a transform length beyond
32-bits is needed, and then 32-bit primes are too small.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.


More information about the gmp-devel mailing list