Best way to carry on 2-input architecture?

Fri Aug 22 12:06:51 UTC 2014

"Wesley W. Terpstra" <wesley at terpstra.ca> writes:

> Nah. You use the same clock for the whole ALU, so your clock rate is
> limited by the same critical path for both circuits. Thus, mullo is
> actually no faster, despite being potentially two gates shorter.

But a 64-bit mul is usually not single cycle? On typical x86 processors,
mullo (or low half of a wide mul) is available a cycle earlier than the
high half (see Torbjörn's https://gmplib.org/~tege/x86-timing.pdf).
You're saying that the difference in circuit delay is likely a lot
smaller than a cycle, so maybe there are other reasons for that (see
below).

> You forget that the EU which contains the multiplier only has a single
> output port to the bypass network. So even if the issue stage somehow
> found a matched pair of mulhi+mullo, you can't output two things at
> once.

Hmm. You could get by with a single port, if you send out the results in
different cycles. Returning to x86 processors, it's also common that one
can issue a widening mul only once every second cycle. Maybe that's how
they do it? Low and high half available in the same cycle, but high half
buffered for an extra cycle before sent to the output port?

And letting the high half be a cycle late is no big cost to bignum
applications, since you typically add the product to something, and then
you get a carry dependency between the operations on the low and high
half anyway.

> If you add the second output, you pay the scaling cost for the
> bypass network. Whether that makes sense depends on how often you
> think you need both halves.

I think it's fairly unusual (at least for bignum applications) that you
want the high half but not the low half. Even for division (see
https://gmplib.org/~tege/division-paper.pdf), the low half is needed
(and interpreted as quotient fraction for the adjustment step). There
may be a few cases where a mulhadd could eliminate the need for the low
half.

>> For modulo (B-1), we get the correct result only modulo B(B-1).
>
> I don't agree. I think you get the correct result modulo B-1. Which is
> fine, since H will be less than that anyway. You just need to
> interpret all 1s as all 0s in the final result.

It's fine for plain mulhi, but for mulhadd, H can equal B-1. Largest
possible value is x = (B-1)*(B-1) + B-1 = B (B-1).

> I think it would certainly work, but I don't see the savings in
> hardware. [...] The only thing you save is that the final adder is
> 64-bits instead of 128-bits. That means a single level of KS addition
> saved.

I see. One would need something more efficient than a Wallace tree for
the wraparound mul, to be able to make any significant savings. You
mentioned NTT as a possibility. That should be more efficient for
wraparound mul than for regular mul. But I would also be a bit surprised
if that can bring any significant savings for sizes as small as 64 bits.

> Simple and dumb. Sadly, it's nearly always the right choice.

With some exceptions... E.g., the Kogge-Stone adder is far from dumb,
and a few days ago I had no idea it existed ;-)

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.