# Best way to carry on 2-input architecture?

Niels Möller nisse at lysator.liu.se
Fri Aug 22 14:33:56 UTC 2014

```"Wesley W. Terpstra" <wesley at terpstra.ca> writes:

> I was not proposing it for 64x64!!!
> I have no idea when NTT would be more efficient.
> Maybe like 2048x2048 or something.

At least in software, the threshold where FFT based multiply becomes
useful is a lot lower for wraparound mul than for regular mul.

> The gate depth of NTT will be 2*log(N)*C where C is the cost of doing
> GF(N) arithmetic in the butterfly stages. Let's guess C is like 8. Oh,
> you also need a small Wallace tree and a KS adder to recombine the
> final carries too.

I think one would have a better chance working with coefficients in the
ring Z/(2^k+1), than in field Z/p. Then 2 is a primitive (k+1)th root
root (2^k = -1), and the butterflies will need only an add and a shift.

But even if you reduce the butterflies to shift and add, having multiple
stages of additions, dependencies will make carry propagation latencies

For concreteness, say we attempt to use a size 16 FFT, over the
cofficient ring 2^{12}+1 (it's some years since I last looked carefully
at FFT-based multiply, but I hope those parameters make sense. The FFT
inputs are 4 bits, but grow to 12 in the output convolution). Then one
could either use 4 stages of dependant butterflies, or skip the "fast"
in the FFT and compute each FFT coefficient using a depth 16 Wallace
tree, producing the right linear combination of shifted input elements.

And one has to be able to do two transforms, and 12-bit wraparound
pointwise muls, all with latency beating the Wallace tree for a plain
64x64 multiply. It seems a bit unlikely to work out in an efficient way.

Regards,
/Niels

--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
```