div_qr_1 interface

Thu Oct 17 19:26:23 CEST 2013

nisse at lysator.liu.se (Niels Möller) writes:

  nisse at lysator.liu.se (Niels Möller) writes:

  > To get going, I've written C implementations of mpn_div_qr_1n_pi1 and
  > mpn_divf_qr1n_pi1, and made divrem_1 call them.

  Below, also an mpn_div_qr_1, using these primitives (and with some
  inspiration from divrem_1). For return value, I use the type

    typedef struct { mp_limb_t r; mp_limb_t qh; } gmp_qr_1_t;

  The order here is a micro-optimization. I expect that on most ABI:s, for
  the typical code fragment

    gmp_qr_1_t res;
    ...
    res.r = mpn_foo (...);
    return res;

  the return value from mpn_foo will already be in the right register, and
  only qh needs to be moved from some callee-save register to the second
  return value register.

I am not too enthusiastic about struct return types for critical
functions.  I expect this to be returned via a stack slot everywhere o
almost everywhere.

But I agree that it is nice to avoid putting that extra quotient limb at
the main quotient operand.  Now, performance is more important than
cuteness...

  I also found some old x86_64 assembly code for the new div_qr_1
  algorithm. With perfect scheduling, it could run at 10 cycles on opteron
  (the main cpu of interest at the time). Main loop is 28 instructions,
  two independent mul, and decoder limited).

I recall to have seen some code for that.  How fast does it run
currently on the various CPUs?

Code comment:

I think we cannot afford to do a separate lshift of the dividend operand
when the divisor is just a few limbs.  We need to to shifting on-the-
fly, however irksome that might be.  AN mpn_div_qr_1u_pi1 is called-for.

-- 
Torbjörn