Best way to carry on 2-input architecture?

Thu Aug 21 21:12:17 UTC 2014

nisse at lysator.liu.se (Niels Möller) writes:

> "Wesley W. Terpstra" <wesley at terpstra.ca> writes:
>
>>> But maybe you can do a circuit for mulhi which is significantly simpler
>>> than for a widening multiplication, because you only need to collect the
>>> carry out from the low partial products? Something like a Wallace tree
>>> where you drop the low output bit from all the half and full adders of
>>> the low product half.
>>
>> If you figure out a way to do this, please let me know.
>>
>> Personally, I don't see how this could be done. The Wallace tree can
>> never reduce you to less than two rows. So, you still have to do the
>> full adder to find the carry for the lower half.
>
> Hmm. I'd have to work out a small example to see if I can make sense of
> this.

Let me think a bit more aloud on mulhi vs full multiplication. I think
you're right above. One *can* reduce a piece at the right end to a
single row, but (1) that saves very little of the total work, and (2) in
case it is a saving, it's likely a saving also for a full
multiplications, since having a single row at the right end means one
can use a shorter adder for the final addition.

And in the final Kogge-Stone adder, for mulhi one can skip all the gates
needed to compute the low sum bits, and only compute the combined P,G
for the low half. But as far as I understand, that would reduce only
gate count, not latency. So this means that a mulhi requires practically
almost the work for a full multiplication. On the other hand, mullo
should be significantly cheaper, with half the number of bit products,
all carries from the leftmost column in the Wallace tree ignored, and a
final adder of half the length.

So what should one do if one have separate mullo and mulhi instructions
and no widening multiply? I guess the common way is to have a circuit
for a full widening multiplier, where mulhi simply discards the low
half. And one could either use the same circuit for mullo (but getting
the result a bit earlier), or a separate mullo circuit in some other
execution unit. If one uses the same circuit for both mulhi and mullo,
maybe one can arrange in the the decoder so that a mullo instruction,
immediately after a mulhi, and with the same operands, can piggyback on
the mulhi instruction.

Since we're going to have a mullo operation, is it possible to have an
instruction or circuit which extends the low half to a full product, in
a way that's cheaper than doing the full product from scratch?

If B = 2^k is the bignum base, and we have some arbitrary two-word
number split in high and low words as x = L + B H, then by the chinese
remainder theorem, we have

  H = (L - x) mod (B+1)

or, if we also have H <= B-2,

  H = (x - L) mod (B-1)  

Apply this with x being a*b, or a*b+c, or a*b+c+d. Then if one first
computes the low half, L = a*b + ... (mod 2^k), one can compute the high half
using a computation (a*b + ...) mod (B±1).

Both variants have their corner cases. For modulo (B-1), we get the
correct result only modulo B(B-1). So we can't quite compute the mulhadd
a*b+c, since the border cases a = b = c = B-1 and a = b = c = 0 give the
same result. Maybe one can workaround by handling one of these extremes
as a special case.

For modulo (B+1), the value B = -1 (mod B+1), can't be represented as a
single word. An extra bit should be no big problem for internal
operations in the cpu, and also no problem for final results where we
know a-priori that H < B. But it causes trouble if one tries to store
temporaries in regular word-size registers.

So does this make sense at all? It should be straightforward to
construct a Wallace tree for (a*b) mod (B-1), and one gets a square
shape with circular propagation of carries. For mod (B+1), one gets the
complication that bits switch sign at wraparound. It's not clear if it's
going to be more efficient than a full multiplication.

But wraparound multiply splits naturally into smaller pieces, either
using some small instantiation Schönhage-Strassen style FFT multiply,
where the roots of unity are hardware-friendly rotates. Or using the
factorization (2^k-1) = (2^{k/2} -1) (2^{k/2}+1).

*If* one can find a way to compute either wraparound product efficiently
in hardware, next question is, should one only use it only internally,
or provide instructions for wraparound arithmetic?

If one likes puzzles, maybe one can look what happens if one constructs
a Wallace tree for mullo, and a Wallace tree for modulo (B-1), and then
try to share full adders for the common pieces.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.