Best way to carry on 2-input architecture?

Wesley W. Terpstra wesley at terpstra.ca
Sat Aug 16 21:35:52 UTC 2014


Good evening,

As a hobby project I've been building an out-of-order superscalar CPU
for use in FPGAs. Everything has been going pretty smoothly, but I'd
like to be able to support gmplib efficiently on it. To keep the issue
stage small+fast, instructions on this machine can only accept two
register arguments at once. No carry flags. I was wondering if there
is a clever way to implement bigint addition with less
instructions/word, by adding a custom instruction or two.

So I can already do the generic gmp version:
  do
    {
      ul = *up++;
      vl = *vp++;
      sl = ul + vl;
      cy1 = sl < ul;
      rl = sl + cy;
      cy2 = rl < sl;
      cy = cy1 | cy2;
      *rp++ = rl;
    }
  while (--n != 0);

In my assembly language this would best be written as:
loop:
  iload ul, up, i
  iload vl, vp, i
  add sl, ul, vl
  addhi cy1, ul, vl ; cuts dependency on sl => gives carry immediately
  add rl, sl, cy
  addhi cy2, sl, cy
  or cy, cy1, cy2
  istore rl, rp, i
  add i, i, 1
  bne i, n, loop
... while that looks awful, it's actually not so bad. Between the
loads and store are only two instructions (2* add), the minimum
possible. Between loop iterations, the critical path path is cy, which
goes through an addhi and an or. I can't yet reorder loads before
stores, but in the future once I can, this means each loop iteration
has latency of 2 cycles (assuming the reorder buffer is large enough).
Since the loop has 10 instructions and I can only issue 4 at once, I
am actually bottlenecked by the poor code density and L1-cache
interface. Hmm. Now I wonder why I am writing this email and perhaps
there is nothing that needs to be done!

For comparison, if I had addc:
loop:
  iload ul, up, i
  iload vl, vp, i
  addc sl, ul, vl ; can not implement this! (3-inputs: CF, ul, vl)
  istore sl, rp, i
  add i, i, 1
  bne i, n, loop
... that is only 6 instructions and the latency is just 1 cycle.
However, I'm still going to be bottlenecked by the L1 and issue-width.

So, now that I've more carefully analyzed the loop, I am actually not
as concerned as I was initially! However, I think the question still
stands: aside from an "addhi" instruction, are there any other custom
instructions I should be adding for bigint arithmetic? I also already
have "umulhi" which should help with multiplication. I am guessing
that when optimizing gmp, there were times when you guys were
irritated by the lack of some specific instruction that would have
made things a lot quicker. This is the sort of information I'd love to
have.

Thanks for your time!


More information about the gmp-devel mailing list