Best way to carry on 2-input architecture?

Niels Möller nisse at
Mon Aug 18 07:01:31 UTC 2014

"Wesley W. Terpstra" <wesley at> writes:

> The restriction of having B as both input and output is
> not particularly onerous, IMO.

I agree. If it allows instructions to be squeezed into 16-bit opcodes
rather than 32-bit, then I'd expact that to more than make up for the
occasional extra mov instructions needed.

> I'm undecided as to if 16 registers are sufficient.

In my experience (and I've primarily written performance critical code
for bignum and crypto things), 16 registers is usually sufficient.

> Your approach is closest to the first option. Because you need to look
> at the "prefix flag" and combine the "prefix register" with the
> immediate value when you do a jump, you have the indirection branch
> problem... on steroids.

I don't think it has to be that bad. First, the prefix flag and register
should be saved and restored on irq, so there should be no problem with
irq:s or page faults and the like in the "middle" of an instruction.

And then I intend the prefix register and prefix flag to live with the
decoder block (so it should never be subject to register renaming). The
prefix instruction is "executed" as part of the decoding, and the full
immediate value/offset is passed with the rest of the decoded
instruction to the main instruction execution block.

I imagine one has to do something a bit similar with the program
counter, to support instructions with the pc as a source operand: The pc
lives in the decoder block, not in the regular register file, and the
right value (i.e., address of the next instruction) is passed on with
those decoded instructions which need it. Is that right? And then
additional hooks are needed for instructions with the pc as destination

>> Or extending that a bit, one needs an efficient way to compute a*b + c +
>> d, where the inputs are single-word values, and the result always fits
>> in two words, with no overflow.
> Hmm. How often does one need this "double" FMA ability? I assume we're
> talking about the high part of the result, because the low part could
> be obtained with normal arithmetic.

On ARM, it is used by addmul_1, addmul_2, addmul_3 and friends, which
means that it is the main work horse of bignum multiply. Search for it
in the mpn/arm subdirectories for examples. And I think you're right
that it's the high half which is critical, even though the ARM instruction
produces both halves.

>> E.g., on ARM there's the "umaal"
>> instruction; to be really useful, it has to be implemented so that it
>> can start the multiply as soon as a and b are available, with the data
>> dependency on c and d happening later on in the pipeline.
> That's not how you do it. The reason FMA is so great in hardware is
> that when you implement say a 32x32 multiplier you first do some
> really cheap addition (no carry needed) using a Wallace tree or
> something similar. Then after the Wallace tree you have two 64-bit
> numbers that you combine with a proper adder to get the final result.
> An FMA puts the extra term to sum into the calculation before the
> Wallace tree, so you essentially get free carries. That means you must
> have a*b+c all ready to go before the FMA.

But if it is done that way, it's almost useless for gmp. One of the add
operands is the input carry word, and if it has to be available before
the multiply, then you get the full multiplication latency on the
critical path. If I remember previous discussions on this list
correctly, experiments showed that at least on cortex-a15, umaal is
*not* implemented that way: The latency between the availability of the
add operands and availability of the output is only a single cycle.


Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

More information about the gmp-devel mailing list