Best way to carry on 2-input architecture?

Sun Aug 17 17:17:06 UTC 2014

On Sun, Aug 17, 2014 at 5:05 PM, Torbjörn Granlund <tg at gmplib.org> wrote:
> My work is available here: https://gmplib.org/~tege/fisa.pdf

Thanks for the link. Currently, I am working on my processor's
architecture and its performance. Thus, I have not yet committed to a
specific ISA. In fact, I have been building it such that it should be
possible to support different ISAs just by swapping out the decoder
stage and setting some VHDL generics. To test it, I've been using a
hackish 16-bit ISA (so I can load 4 ops in one cycle).

Short term, I was planning on supporting the LM32 ISA as my processor
already implements execution units that support all the LM32
operations. This way I could test larger programs without writing a
compiler backend. Longer term I had some novel ideas of how to achieve
a dense ISA encoding I wanted to experiment with. For example,
stack-based with 1 random access argument. Or "fifo"-based with 1
random access argument. Since my processor already renames registers
before issue, I can be quite flexible here.

With that said, you can understand I am quite interested in any
existing ISAs out there which are close enough to my architecture that
I can easily implement them. Aside from the 3-input-operand
instructions, yours look like a pretty nice fit for a potential ISA
frontend.

Is there compiler support for your ISA? Where are the op codes? Is
there a reference implementation I could compare against?

> Of course, encoding 4 separate operands and needing 3 register reads has
> a cost.  An trick for denser coding is to either enforce that e.g. a ==
> d, or that d can just be (say) the low 8 registers.

>From my point-of-view, the problem with 3 operands is not the
encoding. While the encoding matters for code/cache density, the
support for 3 operands in hardware is the killer. When you have U
execution units in an OoO design, they need to have U*U bypasses
between them for each possible operand. In an FPGA this is even worse
as you don't have multiported memory. At the moment I pay 2*U*U memory
blocks to duplicate the register-file between all the execution units.
Increasing to 3 operands is a 50% area and memory increase on the
critical bypass path of the design!

If even one instruction supports 3 operands, I must pay this +50% area
cost. I will try adding a third operand and see what the performance
impact is, but I anticipate a reduction in maximum frequency around
15%. Do 3 operand instructions increase throughput enough to justify
this? In the gmp adder loop I analyzed in my last email, I am not sure
you would actually gain much performance with your "addc" instruction.

What I do like about your ISA is that, having gone 3-operand, you went
all in. FMA for integer+float is great => can skip hardware support of
division. mux and select are great for avoiding branches (and
attendant misprediction risk).

Why did you bother with 2-operand versions of the 3-operand arithmetic?

I noticed you have no smulhi. I was also planning on skipping this,
though it's probably not to costly to make the multiplier unit handle
a sign bit. Is there any use for smulhi at all?

You went with -1 for true, probably to work well with your mux
instruction. Unfortunately, C defines the result of a comparison to be
'1'.  Being forced to turn -1 into 1 everywhere in a C program seems
like a hefty cost. Furthermore, you need to fan-out the carry-out of
the comparison adder to the full register width. For timing
performance, it's better to fanout from an input register's 0-bit than
before you forward the result to other execution units.

To help with new branches I can't yet predict, I was planning on
including a compiler-hint in the opcode indicating if the processor
should take the branch. I would also prefer a branch instruction that
took two operands, so that you can do the comparison in the same
instruction.

> I think a separate carry flag might not be good for modern designs.
> Carry/borrow state is better to keep in a plain register.

Certainly. If I used a flag approach, I would have to rename it to a
full backend register anyway, so using a normal architectural register
is definitely the better way to go.

> And for multiply throughput is more important than low latency

That's lucky. Multiplier throughput is an FPGAs strong suite.

> For multiply, d = (a * b + c) mod B (B being the word base) and d = [(a
> * b + c) / B] are very useful.

I see the benefit to the mulhadd variant (=> no hardware division).
I'm not convinced by the mulladd variant. One cycle saved for one
instruction at the cost of reduced frequency for the whole design.

Again, thanks for the link, it's given me food for thought.