Best way to carry on 2-input architecture?

Wesley W. Terpstra wesley at
Fri Aug 22 09:46:15 UTC 2014

On Thu, Aug 21, 2014 at 11:12 PM, Niels Möller <nisse at> wrote:
> One *can* reduce a piece at the right end to a single row

... yeah, but Wallace trees are essentially ripple adders at the
right-end = slow. So you don't use them for that. Maybe after you get
it down to 3 rows for all columns (ie: the middle) you happen to have
saved a couple bits on the right side, but that's not much, and as you
note, you get the same savings to all of mulhi, fmahi, mullo, and

> On the other hand, mullo
> should be significantly cheaper, with half the number of bit products,
> all carries from the leftmost column in the Wallace tree ignored, and a
> final adder of half the length.

I agree it takes half the area, but in the end you only save 1 level
from the KS trees. So, half the area and ~4% shorter total latency.

> So what should one do if one have separate mullo and mulhi instructions
> and no widening multiply? I guess the common way is to have a circuit
> for a full widening multiplier, where mulhi simply discards the low
> half.

That is what I think is most sensible. The last stage of the
multiplier has a mux that picks the low or high half, depending on a

>  And one could either use the same circuit for mullo (but getting
> the result a bit earlier) or a separate mullo circuit in some other
> execution unit.

Nah. You use the same clock for the whole ALU, so your clock rate is
limited by the same critical path for both circuits. Thus, mullo is
actually no faster, despite being potentially two gates shorter.

> If one uses the same circuit for both mulhi and mullo,
> maybe one can arrange in the the decoder so that a mullo instruction,
> immediately after a mulhi, and with the same operands, can piggyback on
> the mulhi instruction.

You forget that the EU which contains the multiplier only has a single
output port to the bypass network. So even if the issue stage somehow
found a matched pair of mulhi+mullo, you can't output two things at
once. If you add the second output, you pay the scaling cost for the
bypass network. Whether that makes sense depends on how often you
think you need both halves. I'll wager it is not that frequently, and
it is better to add another complete multiplier unit so you can do 2*
mullo or 2* mulhi or mullo+mulhi simultaneously for the same # of
output ports (but also 2-3 more inputs). (mullo is used by lots of
code, so you want lots of them)

> Since we're going to have a mullo operation, is it possible to have an
> instruction or circuit which extends the low half to a full product, in
> a way that's cheaper than doing the full product from scratch?

So, there is no mulhi instruction, only a mulextlo2hi instruction? I
wouldn't want that, as you can use mulhi to implement integer division
and in that case, you do not care about mullo. Furthermore, as you
already noted above, building a circuit which can output either of
mullo/mulhi costs twice the area of mullo and +4% latency. So making a
mullo circuit plus a special mulextlo2hi circuit will probably end up
costing more area (with the same depth).

Sadly, in hardware, simple and dumb usually rules clever and complex.

>   H = (x - L) mod (B-1)

So let me get this straight. Your plan is to compute x=a*b mod (B-1).
In other words, you sum 64 columns of 64-rows in a square shape
instead of the usual parallelogram. Then you throw the twos complement
of the old result (L) in there for 65 rows. Finally, apply Wallace and
a KS adder.

> For modulo (B-1), we get the correct result only modulo B(B-1).

I don't agree. I think you get the correct result modulo B-1. Which is
fine, since H will be less than that anyway. You just need to
interpret all 1s as all 0s in the final result.

> But wraparound multiply splits naturally into smaller pieces...

You're overthinking it.

x = H*B + L
Thus this is ALWAYS true: H = x-L (mod B-1)

So you just have to do the full calculation modulo B-1. That means 3:2
reductions in the high column just puts their bit results to columns
63 and 0 instead of 63 and 64. Easy.

> So does this make sense at all?

I think it would certainly work, but I don't see the savings in
hardware. Your new circuit still has 64x64 ANDs to put into the
square. It still has the same number of rows to sum. The only thing
you save is that the final adder is 64-bits instead of 128-bits. That
means a single level of KS addition saved.

What did it cost you?

You now have two independent circuits: mullo and mulhi. The area cost
is thus 1.5* the size of the original mullo/hi mux circuit. You also
can't compute mulhi independently; you must do mullo first. All that
for reducing the circuits by 1 KS row. I'll wager that in most CPUs
the multiply circuit is not the critical path, so the ~4% delay
savings doesn't even increase the clock rate.

Simple and dumb. Sadly, it's nearly always the right choice.

More information about the gmp-devel mailing list