FFT improvements

Tue Jul 18 18:51:41 CEST 2006

hi all,

With Joel Veness i'm working on some FFT code. We happen to see that GMP is 
using the same method nowadays in version 4.2.
When we've got it to work low level optimized here we'll do a backport of 
that to GMP, as that should speed it up quite some in 64 bits.

Right now it seems for big number ranges > several hundreds of millions of 
bits, current implementation can use only 6 bits a limb (above
that it gets less accurate). We're working on an implementation that's 
either 64 bits or 2 x 64 bits that should give either 15 to 16 bits a
limb or 45 bits a limb or so.

After that parallellization of it. Yet my own hardware has only 4 cores. 
That's a testlimitation.

Our main intend is also to let our library work under windows (which it 
already is doing in a previous incarnation written by Joel just factor 10
slower code or so than the low level i'll try to optimize it for), as the 
visual c++ 2005 is a superior compiler over gcc.

Of course the big problem is that visual c++ 2005 uses intrinsics versus gcc 
some weirdo assembly language incarnation.

So my plan now is to rewrite our library to using the same datatype 
definition like GMP has. That'll take time.

The initial speedup of the FFT code for the mpz_t range (or mpn) at x86-64 
i'm not so sure of, as some will depend how much the ILP (instruction level 
parallellism) of the k8 can handle here.

Perhaps mixing the calculation of 2 limbs times 2 timbs modulo P each, gives 
a bigger ILP at low level.
k8 can of course execute 3 instructions a cycle and the new generation intel 
should be capable of doing 4 a cycle and i would prefer to exploit that.

the speedup largely depends upon such issues however. I'd estimate a factor 
2 at least or so for the hundreds of millions of bitrange.

Feel free to write to us.

Especially: how much wanted is a parallel implementation of GMP?

I've parallellized quite something already at some 512 processor Origins, 
but that was basically search of my chessprogram, which is quite harder
to parallellize than FFT. I can think of some ways to parallellize but 
reinventing the wheel sounds pretty dumb there.

Any thoughts on most efficient way to parallellize big number FFT's?

Thanks,
Vincent
diep at xs4all.nl
skype : diepchess