# Best way to carry on 2-input architecture?

Niels Möller nisse at lysator.liu.se
Wed Aug 20 08:59:35 UTC 2014

```"Wesley W. Terpstra" <wesley at terpstra.ca> writes:

> On Sun, Aug 17, 2014 at 5:05 PM, Torbjörn Granlund <tg at gmplib.org> wrote:
>> For multiply, d = (a * b + c) mod B (B being the word base) and d = [(a
>> * b + c) / B] are very useful.
>
> I see the benefit to the mulhadd variant (=> no hardware division).
> I'm not convinced by the mulladd variant.

I have a couple of questions on hardware multiply, which I think aren't
too off-topic on the list. I've had a look at what Wikipedia says about
Wallace trees, thanks for the hint on that.

First question is on mulhi vs full widening multiplication.

I've been thinking that if you want the full two-word result of a
multiply, and compute each part separately using mulhi and mullo, the
mulhi operation will have to compute the full product anyway, discarding
the low half, and then the mullo will redo the same computation. This
seems redundant, which makes me think that an instruction producing the
complete product would use the hardware more efficiently (assuming one
has the opcode space and other infrastructure needed to support such an
instruction).

But maybe you can do a circuit for mulhi which is significantly simpler
than for a widening multiplication, because you only need to collect the
carry out from the low partial products? Something like a Wallace tree
where you drop the low output bit from all the half and full adders of
the low product half.

And maybe you would want to implement some sort of carry-lookahead
anyway, to reduce latency, so a even a widening multiplication would use
mostly independent circuits for the low and high half?

If I drop the widening multiply from my instruction set, I could reuse
the opcode space for a mulhadd.

Second question is on how large multipliers are implemented. How large
multipliers is it practical to build using a single cycle combinatorial
Wallace tree? How does one organize a 64x64 multiplier, which seems too
large to do in one cycle?

One could build it out of smaller blocks, say we have combinatorial 8x8
multipliers. We then need 64 of those, in sequence or parallel, and a
bunch of adders. Or we could use Karatsuba, to do a 64x64 using 27 8x8
multiplies and a bunch of additions.

So one possibility might be a three-stage pipe-line, where the first
stage exands the two 64-bit inputs into 27 8-bit values each (well, they
grow a little; we either have to manage additional sign bits, or to work
with unsigned only, the largest pieces will grow to 11 bits). Second
stage would do 27 smaller multiplies in parallel. And the third and
final stage would combine those 27 products to the complete 128 bit
product (not quite sure how it would look like if we want the high half
only).

Does that make any sense?

Regards,
/Niels

--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
```