Best way to carry on 2-input architecture?

Wesley W. Terpstra wesley at
Mon Aug 18 01:00:33 UTC 2014

On Sun, Aug 17, 2014 at 10:00 PM, Niels Möller <nisse at> wrote:
> tg at (Torbjörn Granlund) writes:
>> My work is available here:

Thank you for the link.

> And mine at (a toy
> project, trying to squeeze a decent instruction set with 16 general
> purpose registers into 16-bit opcodes).

As I mentioned, my current ISA is also 16 bit. I won't go to the
trouble of writing it up, as I intend to ditch it for something better
as soon as I've finalised the core CPU design. However, I went with
the format OoAB (each 4 bit) where the main op code is O, and
operations are always gp(B) <= gp(A) op gp(B). The format only
supports 8-bit constants via an instruction that sets gp(B) <=
sext("oA"). The restriction of having B as both input and output is
not particularly onerous, IMO. If you have a CPU with register
renaming, executing a copy instruction is free (except for the space
in memory). So if the compiler can't avoid a copy before the
instruction, at least it doesn't add any latency. I'm undecided as to
if 16 registers are sufficient.

The main reasons this is only a temporary ISA is that there is no easy
way to get large enough branch displacements. I considered adding a
special "prefix" register as you describe, but that has some serious
problems when built as actual hardware:

The first option is just using the "prefix" register by name. So you
shift in your constant, 12bits at a time, then execute an instruction
that uses it by name. That works well for arithmetic. The downside to
this scheme is that all your branches are now indirect branches. That
makes the job of the fetch stage very hard. Probably too hard.

The alternative is to use prefix instructions. Here you load some
constant in pseduo-instructions that become no-ops and the next "real"
instruction receives the constant as an immediate in place of gp(B).
The downside to this is that it makes the ISA context sensitive. When
you receive an interrupt, you need to make sure that the return PC
points to the start of the prefix sequence. On the plus side, the
"prefix" instructions can be readily identified during the fetch stage
and thus branch prediction is nearly as easy as in a classic RISC. I
think this is the better of the two options.

Your approach is closest to the first option. Because you need to look
at the "prefix flag" and combine the "prefix register" with the
immediate value when you do a jump, you have the indirection branch
problem... on steroids.

> I'm not familiar with implementing cpu's (I'd like to learn some day),
> but I imagine that additional read/write ports to a couple of flag bits
> are cheaper than additional read/write ports to the much wider register
> file.

That's definitely true for an in-order CPU. If you really want flags,
you can have them on an out-of-order CPU too. It does imply a 3rd
port, though. Just because the flag register-file is narrow doesn't
mean that all the attendant circuitry to rename the flags and delay
instructions until needed flags become ready goes away, though.

> Or extending that a bit, one needs an efficient way to compute a*b + c +
> d, where the inputs are single-word values, and the result always fits
> in two words, with no overflow.

Hmm. How often does one need this "double" FMA ability? I assume we're
talking about the high part of the result, because the low part could
be obtained with normal arithmetic.

> E.g., on ARM there's the "umaal"
> instruction; to be really useful, it has to be implemented so that it
> can start the multiply as soon as a and b are available, with the data
> dependency on c and d happening later on in the pipeline.

That's not how you do it. The reason FMA is so great in hardware is
that when you implement say a 32x32 multiplier you first do some
really cheap addition (no carry needed) using a Wallace tree or
something similar. Then after the Wallace tree you have two 64-bit
numbers that you combine with a proper adder to get the final result.
An FMA puts the extra term to sum into the calculation before the
Wallace tree, so you essentially get free carries. That means you must
have a*b+c all ready to go before the FMA.

FMA is really great. I really want it. I want it so bad it tempts me
to implement 3 ports just to have it. However, 3 ports cause me so
much pain. :-( I've even contemplated crazy ideas like a special FMA
two-part instruction sequence:

fma-pt1 t, a, b
fma-pt2 y, t, c

The idea being that 't' is used to force sequencing of the
instructions and then you make the bypass between multiply units
double-width so they can stash both a and b into t to pass to the unit
who gets fma-pt2. But this is crazy talk. I am sure this sort of hack
would turn out to be a bad idea longer term. For one thing, it's
another context-sensitive instruction...

> A bit more specialized, if you don't have a divide instruction, you want
> to be able to do division via multiplication.

Yeah, I'd go with Goldschmidt followed by one Newton iteration, both
using FMA. Whether I microcode it or just make sure the right
instructions are available for very good inlining, doesn't matter

> And a count-leading-zeros instruction

Already have it. ;)

More information about the gmp-devel mailing list