Best way to carry on 2-input architecture?

Wesley W. Terpstra wesley at
Wed Aug 20 14:20:20 UTC 2014

On Wed, Aug 20, 2014 at 10:59 AM, Niels Möller <nisse at> wrote:
> I've been thinking that if you want the full two-word result of a
> multiply, and compute each part separately using mulhi and mullo, the
> mulhi operation will have to compute the full product anyway, discarding
> the low half, and then the mullo will redo the same computation. This
> seems redundant

Yes, it is.

This is why some architectures output the low+high at the same time.
However, that means instructions can potentially have two outputs,
which increases the already-quadratic execution-unit-bypass cost. IMO,
trying to output both halves at once is a bad idea. Bypass ports are
more expensive than multiplier units.

> But maybe you can do a circuit for mulhi which is significantly simpler
> than for a widening multiplication, because you only need to collect the
> carry out from the low partial products? Something like a Wallace tree
> where you drop the low output bit from all the half and full adders of
> the low product half.

If you figure out a way to do this, please let me know.

Personally, I don't see how this could be done. The Wallace tree can
never reduce you to less than two rows. So, you still have to do the
full adder to find the carry for the lower half.

> And maybe you would want to implement some sort of carry-lookahead
> anyway, to reduce latency, so a even a widening multiplication would use
> mostly independent circuits for the low and high half?

Don't forget that ASICs do not use ripple adders. They use Kogge-Stone
adders, or a variation with similar circuit depth, but less area. So
carry-lookahead is crap compared to what they use for the final sum.

> Second question is on how large multipliers are implemented. How large
> multipliers is it practical to build using a single cycle combinatorial
> Wallace tree? How does one organize a 64x64 multiplier, which seems too
> large to do in one cycle?

These are all questions for an ASIC designer. I work with FPGAs.

However, I think your question exposes some confusion. It's not how
large the multiplier is that matters. You must pay the full area cost
for the multiplier no matter how many cycles it takes. In fact,
increasing the latency/cycles ADDS area cost. [There are ways to
reduce the area, but if you want to be able to issue one multiply
every cycle (like a CPU), those techniques are unavailable.]

So, for a 64x64 multiplier, the circuit starts with 1 level of 64x64
AND gates, followed by 8 levels of 3:2 Wallace reduction, followed 6
levels of a KS adder.

That circuit has some technology-dependant latency from start to
finish. Lets say it takes 1 nanosecond. That would mean they can run
it at 1GHz with a 1 cycle latency. However, if they put a row of
registers in the middle, the latency for each cycle is cut in half +
the register's delay. So maybe they get 1.8GHz with 2 cycle latency.
The latency of the multiplier is a trade-off you make to increase the
maximum frequency of the device. In all cases, the multiplier hardware
is identical, only the position and number of registers in the circuit

> One could build it out of smaller blocks, say we have combinatorial 8x8
> multipliers. We then need 64 of those, in sequence or parallel, and a
> bunch of adders. Or we could use Karatsuba, to do a 64x64 using 27 8x8
> multiplies and a bunch of additions.

Those are software techniques. You would not do that in an ASIC.

You *might* do this in an FPGA, because we have so-called "hard IP"
blocks that are basically fixed-function multipliers of a specific
size. So, the Arria V series has hardware to do 27x27 multiplies that
I can use as building blocks.

For reference, here is the implementation I wrote for multipliers on
an FPGA: <>
This multiplier has latency of 2-4 cycles depending on whether or not
I enable some registers to improve the circuit performance.

I don't do Karatsuba or anything like that, because, ... well, it's a
bad idea. :-) I use a Wallace tree to get the result of the (27x27 or
whatever) DSP blocks down to two-three parts and then I use the FPGA's
dedicated adder logic.

> Does that make any sense?

No. You would build it as I outlined above. 64x64 AND gates, Wallace
reduction, and a log-deep adder. The only variation is where you put
the registers.

If you were building a truly huge multiplier, like 256x256 or bigger,
then you would use a state-machine which drives and reuses primitive
multiplier units built like above. Such a state-machine would never be
used in a CPU, but might be used in a custom crypto chip or something.
A CPU would just run a program that did the same thing.

More information about the gmp-devel mailing list