New n log n algorithm for high-precision multiplication

Mon Apr 15 12:23:22 UTC 2019

       Dear Niels,

> From: nisse at lysator.liu.se (Niels Möller)
> Date: Mon, 15 Apr 2019 14:02:28 +0200
> 
> paul zimmermann <Paul.Zimmermann at inria.fr> writes:
> 
> > Schönhage-Strassen can be implemented with good cache locality:
> >
> > https://hal.inria.fr/inria-00126462
> 
> Thanks! As I read it, for large inputs, the top level transforms operate
> on coefficients that fit in L2 cache but not L1 cache, and with several
> passes over the data, depending on fft size. Is that right?

not quite. In Section 2.2.3 (Bailey's 4-step transform) we describe a way
to please both the L2 and L1 caches. Of course this work should be revisited
with modern processors.

> Then I think small-prime fft has potential for better locality, with few
> complete passes of the data and all the heavy fft work operating on data
> in the L1 cache.

maybe, I'm curious to see a comparison with our code!

Paul