Multi-limb inverse for mpn_divrem_1

Tue Sep 16 22:44:08 CEST 2003

Kevin Ryde <user42 at zip.com.au> writes:

  Torbjorn Granlund <tege at swox.com> writes:
  >
  > The problem with division by a single-limb divisor is that the
  > limb multiply instructions lie on the recurrency path.  Pipelined
  > multiply units are largely wasted.

  I wonder if it'd be possible to make the trial quotient correct most
  of the time.  That way the addback could be taken off the dependent
  chain, or rather it would become dependent only occasionally.

It would be possible to get the quotient almost always right, by
including more bits to the reciprocal.

I can see two approaches here, include a bit more or include a
full limb more.  By including just a bit more, one would
conditionally add the partial remainder to the preliminary
quotient.  The quotient would still sometimes be too small, but
not at all as often as it is now.

In practice one would use separate loops for when the extra bit
is 0 and for when it is 1.

Including a full limb more would of course make the quotient almost
always correct.  But the cost is probably too high.

  The idea would be to kick off the next iteration in parallel with
  verifying the correctness of the present quotient limb (or limbs).

  Haven't thought this through, it might be too messy, or might be too
  costly to improve the quotient.

As you said in your follow-up message, the next preliminary
quotient is dependent on the previous partial remainder, which in
turn depends on the previous quotient.

But one doesn't need to wait for the partial remainder to
actually start the multiplying for the next quotient, since only
*one* limb of the partial remainder comes from the previous
iteration, and the remaining ones come from the dividend!

This is something a realized just the other day.

This shows the situation for a 4-limb recurrency:

                             _____ _____ _____ _____ 
 dividend/partial remainder |_u3__|_u2__|_u1__|_rem_|
     precomputed reciprocal |_r3__|_r2__|_r1__|_r0__|
                            .
                            .            _____ _____
  partial products:         .      _____|XXXXX|XXXXX|
                            ._____|XXXXX|XXXXX|
                       _____|_____|/////|
                      |_____|_____|
                            .      _____ _____
                            ._____|XXXXX|XXXXX|
                       _____|_____|/////|
                 _____|_____|_____|
                |_____|_____|
                            ._____ _____
                       _____|_____|/////|
                 _____|_____|_____|
           _____|_____|_____|
          |_____|_____|     .
                       _____._____
                 _____|_____|_____|
           _____|_____|_____|
     _____|_____|_____|     .
    |_____|_____|           .
    =================================================
     __7__ __6__ __5__ __4__ __3__ __2__ __1__ __0__ 
    |_q3__|_q2__|_q1__|_q0__|_ign_|_ign_|_ign_|_ign_|

Diagram explanations:
  u3, u2, u3 are limbs from the dividend.
  rem is the remainder from the previous iteration.
  q3, q2, q1, q0 are preliminary quotient limbs.

Since we only care abuot the upper product half, here 4 limbs,
and since the code is already prepared to deal with preliminary
quotients that are slightly off, we don't need to compute the
products marked with XXXXX or /////.  (The difference is that
XXXXX products will not need to be computed on CISC-style
computers while RISC-style computed will also not need to compute
the ones marked with /////.)  Products in columns 3 probably
should be computed, though, since their sum will carry directly
into the preliminary quotient.

By some luck, two of the three XXXXX|XXXXX products and one /////
product depend on the recurrency rem.  We therefore have quite
little work to do after the recurrency rem becomes available (one
full multiply and one multiply giving the upper limb).

The preliminary remainder is then computed by means of 4 full
limb multiplies (or 7 half multiplies on RISCs)

  Some code below I put together along those lines.  Basically the same,
  but using more macros.  I think it's correct, but I don't think I got
  around to seeing which chips it suited.  ev6 might have been the
  inspiration, itanic might be helped.

Almost any chip with a (fully or partially) pipelined integer
multiplier should be helped.

-- 
Torbjörn