mod_2

Thu May 6 16:26:49 CEST 2010

nisse at lysator.liu.se (Niels Möller) writes:

  shell.gmplib.org.

My code was optimized for K8-K10.

  > The x86_64/divrem_2.asm code uses the udiv_qr_3by2 algorithm, unless I
  > am much mistaken.  Does C code beat the assembly code for the same
  > algorithm?
  >
  > Also x86/divrem_2.asm use the udiv_qr_3by2 algorithm.

  Then my results are more unexpected. But there should be some possible
  savings for not storing or adjusting the quotient.

Perhaps, but I think the saving is mainly that you run my code +
MPN_COPY and your code directly.

  > I think you would get better (small operands) figures if you called
  > divrem_2 directly, instead of going through mpn_tdiv_qr.  IIRC, the
  > semantics of mpn_tdiv_qr will require a copy to be made.

  But forcing it to copy is fair, if that's needed for not clobering the
  input... (even though I guess the assembler implementation clobbers the
  input only to store the remainder in the first two limbs).

Perhaps it is "fair" by some definition, but if we are looking for a
fast division-by-2limb loop, then it is more useful to compare the loops
shoulder-to-shoulder.

  I've now also tried calling divrem_2 directly (but with copying things
  around to get the interface I want):

    void
    mod_2_ref (mp_ptr rp, mp_srcptr up, mp_size_t un, mp_ptr dp, mp_ptr scratch)
    {
      ASSERT (un >= 2);

      MPN_COPY (scratch, up, un);
      mpn_divrem_2 (scratch + un, 0, scratch, un, dp);
      MPN_COPY (rp, scratch, 2);
    }

  And my C loop seems to still beat divrem_2 for n >= 3 (but with less
  margin, around 5--8%).

That's interesting.  How many cycles per limb does you code take?

  So the fraction function(s) should take as inputs r, d with r < d and a
  fraction size fn, and compute quotient and optionally remainder for the
  division r B^fn / d. And with special functions for d of size 1 and 2.

Yes.

  Name for it?

There are two major unsolved problems in computer science, whether P =
NP and naming things...

  > I'd like a set of plain functions with just a single simple loop.  In
  > such lowest-level function testing whether to compute some type of
  > inverse or not should not be done, whether divisors are normalised or
  > not should not be done.

  So, what should the interface be?

    void
    mpn_div_qr_2 (mp_ptr qp, mp_ptr rp, mp_srcptr up, mp_size_t un,
    	        mp_srcptr dp, mp_limb_t dinv);
    void
    mpn_div_r_2 (mp_ptr rp, mp_srcptr up, mp_size_t un,
    	       mp_srcptr dp, mp_limb_t dinv);

  Subtleties: This makes sense for normalized inputs. But for unnormalized
  inputs, we should also pass the shift count. I think you have planned to
  change

    typedef struct {mp_limb_t inv32;} gmp_pi1_t;

  to

    typedef struct {mp_limb_t inv32; unsigned shift;} gmp_pi1_t;

  Is that the way to go here as well? Or maybe we should include the
  shifted top limbs too, like

    typedef struct {
      mp_limb_t inv32;
      unsigned shift;
      mp_limb_t d0;
      mp_limb_t d1; /* Always normalized */
    } gmp_pi1_t;

  If so, we get just

    void
    mpn_div_qr_2 (mp_ptr qp, mp_ptr rp, mp_srcptr up, mp_size_t un, gmp_pi1_t *d);

    void
    mpn_div_r_2 (mp_ptr rp, mp_srcptr up, mp_size_t un, gmp_pi1_t *d);

Many good questions...

What about first listing the loops, each corresponding to a low-lever
function, then when we're pretty sure we have the full set, we can start
worrying what food the functions will need?

-- 
Torbjörn