Yes, I'm familiar with David Harvey's ideas. I've implemented them in NTL, but not completely in assembly. 

30 bit primes may indeed be too small for your application. With the integer ifma52 instructions, you are stuck at 52 bits... Or just 50 if you want to use David's tricks. 

Also, Google mathemagix simd... They have a paper that has some nice ideas for simd ntt's

