Carry propagation (was: Re: Risc V greatly underperforms)

Tue Oct 5 06:15:04 UTC 2021

Torbjörn Granlund <tg at gmplib.org> writes:

> This all means that on Risc V, multi-word subtraction could be made to
> at 2 cycles/word while multi-word addition is limited to 3 cycles/word,
> in both cases assuming a very wide super-scalar core.  Remember that
> other concurrent CPUs do these in 1+epsilon cycles/word, and that
> without needing to do wide super-scalar dispatch.

For machines that are *very* carry propagation challenged, one could
consider doing carry propagation as a tree with much less than n
propagation latencies (like in a Kogge-Stone or Brent-Kung hardware
adder).

Step 1: Compute, independently for each limb pair a, b:

s <-- a + b
g <-- s < a       "generate": there's always carry out
p <-- (s == B-1)  "propagate": carry in implies carry out

p and g are two bits, but representing 3 values, since when g is 1, p is
always 0.

Step 2: Form a tree of depth log(n) combining the p and g bits of adjacent
segments of limbs, (g_hi, p_hi), (g_lo, p_lo) as

  g = g_hi | (p_hi & g_lo)
  p = p_hi & p_lo

The tree can produce the final carry bits at all limb position. Or we
can chose to compute them for some subset (e.g, final carry at every 4th or
every 8th limb), or we could limit the depth of the tree and compute g,p
bits for each segment of, say, 32 limbs.

Step 3: Apply the carries, and if we didn't compute them all, do
remaining propagation (which will have a much shorter depth than doing
it for all limbs).

Seems quite unlikely to be a win in software, but might be useful if
it's possible to compute lots of (s, g, p) items in a single cycle,
using very wide SIMD registers and/or very wide super-scalar
architecture.

Regards,
/Niels
-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.