Best way to carry on 2-input architecture?

Mon Aug 18 11:36:44 UTC 2014

On Mon, Aug 18, 2014 at 9:01 AM, Niels Möller <nisse at lysator.liu.se> wrote:
> I don't think it has to be that bad. First, the prefix flag and register
> should be saved and restored on irq, so there should be no problem with
> irq:s or page faults and the like in the "middle" of an instruction.

Yes, that's the advantage. Keep in mind, though, that dealing with
variable-length instructions is a well understood and not-so-difficult
problem. I just need to report the PC as being at the start of the
prefix-chain. This change is local to the decoder.

> one has to do something a bit similar with the program counter

I agree with everything you said about the PC.
Dealing with the PC is a real pain.
Not something to copy unless you must.

> And then I intend the prefix register and prefix flag to live with the
> decoder block (so it should never be subject to register renaming). The
> prefix instruction is "executed" as part of the decoding, and the full
> immediate value/offset is passed with the rest of the decoded
> instruction to the main instruction execution block.

The value in the decoder is ahead of the values seen in the execution
units. If an exception occurs, you need to be able to rewind/reset the
value in the decoder to the state it would have had if execution had
gone to the correct destination at that point. Upon reflection, I
think you are correct that I could track the state of the shift
register all the way through the pipeline, just like the PC. Then,
when I retire an exception-producing instruction, I feed the
fetcher/decoder the old shift register state at the same time I give
it the new PC.

So, I agree it could be done. But, I still think it's more expensive
than supporting variable-length instructions. Going down the
variable-length instruction path also opens the doors to other
features, like your 4-operand umaal instruction.

> But if it is done that way, it's almost useless for gmp. One of the add
> operands is the input carry word, and if it has to be available before
> the multiply, then you get the full multiplication latency on the
> critical path. If I remember previous discussions on this list
> correctly, experiments showed that at least on cortex-a15, umaal is
> *not* implemented that way: The latency between the availability of the
> add operands and availability of the output is only a single cycle.

That's very strange. So, let me preface my response by noting that I
know very little about ARM and that all of my practical HDL experience
is with FPGAs, not ASICs. Furthermore, while I've had to debug a few
soft CPUs, I build different circuits professionally. This CPU project
is something I'm doing for fun. So I am definitely not an expert on
how ARM implemented this.

That said, from what you describe, it sounds to me like they've
actually decomposed the FMA into two micro-ops. The first does the
32x32 a*b and the second runs when the 64-bit multiply result and the
32-bit c term are available. Otherwise, their instruction scheduler
would somehow have to be able to look into the future to see that it
has a&b ready now and will have c later. That is probably possible,
but I'll wager it's prohibitively expensive, likely doubling the
length of the scheduling critical path. If the FMA is decomposed into
two micro-ops, a bog standard instruction scheduler would have the
behaviour you describe. This approach would also mean that they could
use a 2-port processor core to implement FMA at the cost of a
double-wide bus for one of the operands.

This has given me something to think about, as it might be a way for
my implementation to also get FMA with 2 ports. Frankly, though, I am
more tempted to go with a fully 3-port design like Torbjörn's than
building in some hack.

> On ARM, it is used by addmul_1, addmul_2, addmul_3 and friends, which
> means that it is the main work horse of bignum multiply. Search for it
> in the mpn/arm subdirectories for examples.

I agree the instruction look interesting. From the fma behaviour you
described above, I strongly suspect that ARM has one of their register
file ports 64-bit wide. That would let you implement all of these
sorts of operations with micro-ops. Again, though, I know almost
nothing about ARM.

PS. Sorry you got this twice, Niels. I missed the reply all button.