Best way to carry on 2-input architecture?

Wesley W. Terpstra wesley at
Mon Aug 18 14:34:58 UTC 2014

So, yeah, we're drifting pretty far off-topic now. If you have
follow-up questions that you think aren't interesting for gmp-devel,
just send them to me privately.

On Mon, Aug 18, 2014 at 2:35 PM, Niels Möller <nisse at> wrote:
> I think the decoder could implement the prefix instruction, as I've
> defined it, in that way, treating a sequence of prefix instructions +
> non-prefix instruction as an indivisible longer instruction. Supervisor
> mode/kernel mode code might need to know if there really is a prefix
> register or not, but otherwise, it's an implementation detail not
> visible to user code.

I don't agree. The point of your approach is that the shift
instruction is NOT a prefix, but a "proper" instruction. Thus, other
instructions can come between it and the instruction which finally
consumes the shift flag. So you can't treat the "shift op + consumer
op" as an indivisible whole, the way you can with variable-length
instructions. This is the key difference.

Either you say it's a prefix op (and thus indivisible from the
consuming op, essentially a variable-length op) or you let
"shift-in-a-constant" be a proper instruction. You choose the later.
Thus, you need to track it all the way through the pipeline so that
you can abort it.

> If you get a page fault from a reordered load or store, or some
> other exception associated with the execution of a particular
> instruction, how do you stop the instruction flow at the correct point
> before the control transfer to the handler?

So your outline is more-or-less right. There are different approaches
to this problem, depending on your goals. My goals are: exceptions can
be slow, cancelled reads are allowed to be visible outside of the CPU,
cancelled writes are not. Thus, I forbid a write from even being
reordered before an unconfirmed read or branch. This is not
particularly difficult since I only have a single loadstore unit due
to resource constraints and I have to watch for RaW conflicts through
memory anyway.

To cancel the other operations, I have two register renaming maps
(from architectural registers to backing registers). One is at the
front of the pipeline, near the decode stage. The other is at the end
of the pipeline at the retire stage. When an instruction enters the
OoO window, I update the front map. When an instruction is retired
(leaves the OoO window), I update the back map. Instructions only
leave the OoO window, in order, when they've completed. Obviously, I
can retire and load multiple instructions / cycle.

If an exception occurs, I just tag the instruction as killed. When it
reaches the retire stage, I destroy everything in the OoO window and
force the rename stage to load the retire stage's rename map. This
stops processing of any remaining instructions and reverts the writes
of any that were already done. I use this scheme for branch
misprediction too. In all cases, I then supply the PC that was
associated with the killed instruction to the fetch unit. This is
where I would need to resupply the "shift register" state in your ISA
as well.

It sounds expensive, but after the CPU has been running a bit without
exceptions, you generally have stuff waiting at the bottom of the OoO
window from whatever the critical path through the code is anyway.
There are other approaches that are faster. If it turns out to be too
slow for misprediction, I may revise my plan. Mine is very simple,
though, which makes it good for a first version on an FPGA.

More information about the gmp-devel mailing list