Best way to carry on 2-input architecture?

Fri Aug 22 13:35:12 UTC 2014

On Fri, Aug 22, 2014 at 2:06 PM, Niels Möller <nisse at lysator.liu.se> wrote:
> But a 64-bit mul is usually not single cycle?

Of course not. I'm not sure what you're driving at here.

If mullo and mulhi both have the same clock and are both 3 cycles
deep, it does not matter if one has slightly lower latency between two
registers. A chip is timed so that the slowest path between any two
registers for a single clock domain is guaranteed to complete within
the clock period.

> On typical x86 processors,
> mullo (or low half of a wide mul) is available a cycle earlier than the
> high half, so maybe there are other reasons for that

All I can say for sure is that they didn't HAVE to make it one cycle
later. There are all kinds of reasons they might have done so that
have nothing to do with the latency of a multiplier circuit.

> You could get by with a single port, if you send out the results in
> different cycles. Returning to x86 processors, it's also common that one
> can issue a widening mul only once every second cycle. Maybe that's how
> they do it?

It's possible. We don't have the HDL for an Intel processor, so who knows.
Real chips have lots of complicating factors that need to be balanced.

I am not sure this scheme is a big win, though. Outputting the high
half a cycle later means you cannot issue a new multiply through the
EU during that cycle. So, back to my mux(high,low) multiplier example,
I can output both halves on alternating cycles too, with the same
throughput. That's because my mulhi goes in where they have a pipeline
stall to push out the high half.

If they are doing that, maybe it's to save power or something.

> I think it's fairly unusual (at least for bignum applications) that you
> want the high half but not the low half. Even for division (see
> https://gmplib.org/~tege/division-paper.pdf), the low half is needed
> (and interpreted as quotient fraction for the adjustment step). There
> may be a few cases where a mulhadd could eliminate the need for the low
> half.

When I said integer division, I did not mean bignum integer division.
If there is no hardware divide instruction, you can implement it using
fmahi. In that case you don't need fmalo at all.

My CPU does a 64x64=>128 and uses a mux to pick hi/lo halves. For my
design, I am sure this is the right choice. It keeps the issue stage
very simple and I want to use the circuit for division (software or
microcode). If I was building an ASIC and had different goals, maybe
I'd do something else. I'll wager there are more users of integer
division than there are users of gmp, so the trade-off in my case is
clear. =D

> It's fine for plain mulhi, but for mulhadd, H can equal B-1. Largest
> possible value is x = (B-1)*(B-1) + B-1 = B (B-1).

Fair enough. You could easily detect that earlier in the pipeline by
seeing if your c in a*b+c is all ones.

> mentioned NTT as a possibility. That should be more efficient for
> wraparound mul than for regular mul. But I would also be a bit surprised
> if that can bring any significant savings for sizes as small as 64 bits.

I was not proposing it for 64x64!!!
I have no idea when NTT would be more efficient.
Maybe like 2048x2048 or something.

The gate depth of NTT will be 2*log(N)*C where C is the cost of doing
GF(N) arithmetic in the butterfly stages. Let's guess C is like 8. Oh,
you also need a small Wallace tree and a KS adder to recombine the
final carries too.

So, NTT gate depth always be much worse than a standard multiplier,
except for the wire delays due to the huge size of a really big
standard multiplier as the quadratic AND area explodes.

To be fair, both KS and a butterfly network will also have serious
problems with wire lengths in the far mixing rows. Maybe there's a
better, as yet undiscovered circuit. At the scales we're talking
about, wire delay will be the dominant cost.

FYI, as you didn't know about the KS-adder: It's just a specific
example of the more general prefix-sum design pattern. That's a
standard building block for parallel algorithms (including hardware).
There are lots of variations. KS has nlogn area and logn depth. You
can also do prefix-sum (and thus adders) in n area and 2logn depth.
Lots of trade-off options.