Best way to carry on 2-input architecture?

Niels Möller nisse at
Wed Aug 20 19:39:53 UTC 2014

"Wesley W. Terpstra" <wesley at> writes:

> I'm going to retract my out-of-hand rejection of this idea.

Depends on the size, I guess. For 64x64, my example scheme would reduce
the number of and gates from 4096 to about half. Maybe that's not worth
the effort, since there's also going to be quite a lot of addiition work.

But if you want a larger multiply, say 1024 x 1024, then with a 32-bit
base case, you reduce the and gates from 1 millon to 250 000. I'm not
saying that scheme really makes sense, it might be better to use, e.g, a
circuit for 64x1024, and iterate.

> I can't visualize how the summands get recombined. It's not as simple
> as adding the results using a Wallace tree, though, because you have
> common sub-expressions that you can't destroy or you lost the benefit.

I haven't thought this through, and in particular I have no idea how to
handle carry propagation between blocks correctly. But I would expect that you
first split to get a vector of 8 elements, 8-bits each. Then multiply by
an 27x8 matrix with elements mostly ones and zeros. And after the
pointwise multiply, multiply by another 14x27 matrix with simple
elements, and then add pieces together at the appropriate positions.

> What I can visualize, and think fits well to hardware, is an NTT-based
> algorithm. Widen all the bits by log(N), do an NTT to both inputs,
> multiply, and another NTT. This would be a good fit for hardware
> beyond some threshold.

An fft of size 2^k should definitely give a cleaner structure than k
levels of expanded Karatsuba multiplication. A related problem is
hardware circuits for wraparound multiplication, i.e., a*b mod (2^n ±


Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

More information about the gmp-devel mailing list