Best way to carry on 2-input architecture?

Fri Aug 22 14:33:56 UTC 2014

"Wesley W. Terpstra" <wesley at terpstra.ca> writes:

> I was not proposing it for 64x64!!!
> I have no idea when NTT would be more efficient.
> Maybe like 2048x2048 or something.

At least in software, the threshold where FFT based multiply becomes
useful is a lot lower for wraparound mul than for regular mul.

> The gate depth of NTT will be 2*log(N)*C where C is the cost of doing
> GF(N) arithmetic in the butterfly stages. Let's guess C is like 8. Oh,
> you also need a small Wallace tree and a KS adder to recombine the
> final carries too.

I think one would have a better chance working with coefficients in the
ring Z/(2^k+1), than in field Z/p. Then 2 is a primitive (k+1)th root
root (2^k = -1), and the butterflies will need only an add and a shift.

But even if you reduce the butterflies to shift and add, having multiple
stages of additions, dependencies will make carry propagation latencies
add up.

For concreteness, say we attempt to use a size 16 FFT, over the
cofficient ring 2^{12}+1 (it's some years since I last looked carefully
at FFT-based multiply, but I hope those parameters make sense. The FFT
inputs are 4 bits, but grow to 12 in the output convolution). Then one
could either use 4 stages of dependant butterflies, or skip the "fast"
in the FFT and compute each FFT coefficient using a depth 16 Wallace
tree, producing the right linear combination of shifted input elements.

And one has to be able to do two transforms, and 12-bit wraparound
pointwise muls, all with latency beating the Wallace tree for a plain
64x64 multiply. It seems a bit unlikely to work out in an efficient way.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.