Best way to carry on 2-input architecture?

Wesley W. Terpstra wesley at
Wed Aug 20 16:49:08 UTC 2014

On Wed, Aug 20, 2014 at 10:59 AM, Niels Möller <nisse at> wrote:
> One could build it out of smaller blocks, say we have combinatorial 8x8
> multipliers. We then need 64 of those, in sequence or parallel, and a
> bunch of adders. Or we could use Karatsuba, to do a 64x64 using 27 8x8
> multiplies and a bunch of additions.

I'm going to retract my out-of-hand rejection of this idea.

> So one possibility might be a three-stage pipe-line, where the first
> stage exands the two 64-bit inputs into 27 8-bit values each (well, they
> grow a little; we either have to manage additional sign bits, or to work
> with unsigned only, the largest pieces will grow to 11 bits). Second
> stage would do 27 smaller multiplies in parallel. And the third and
> final stage would combine those 27 products to the complete 128 bit
> product.

I can't visualize how the summands get recombined. It's not as simple
as adding the results using a Wallace tree, though, because you have
common sub-expressions that you can't destroy or you lost the benefit.

What I can visualize, and think fits well to hardware, is an NTT-based
algorithm. Widen all the bits by log(N), do an NTT to both inputs,
multiply, and another NTT. This would be a good fit for hardware
beyond some threshold.

More information about the gmp-devel mailing list