mod_2

Thu May 6 15:45:33 CEST 2010

Torbjorn Granlund <tg at gmplib.org> writes:

> On which machine are you testing the x86_64 code?

shell.gmplib.org.

> The x86_64/divrem_2.asm code uses the udiv_qr_3by2 algorithm, unless I
> am much mistaken.  Does C code beat the assembly code for the same
> algorithm?
>
> Also x86/divrem_2.asm use the udiv_qr_3by2 algorithm.

Then my results are more unexpected. But there should be some possible
savings for not storing or adjusting the quotient. I think this is the
inner-loop generated by gcc:

        .p2align 4,,7
.L38:
        subq    $1, %r15
        subq    $8, %rdi
        cmpq    $2, %r15
        je      .L32
.L37:
        movq    %r12, %rax
        movq    -24(%rdi), %rcx
        mulq %r9
        movq    %rax, %rsi
        addq %rbx,%rsi
        adcq %r12,%rdx
        movq    %rdx, %rax
        movq    %rbx, %r12
        imulq   %r13, %rax
        subq    %rax, %r12
        movq    %rbp, %rax
        subq %rbp,%rcx
        sbbq %r13,%r12
        mulq %rdx
        subq %rax,%rcx
        sbbq %rdx,%r12
        xorl    %eax, %eax
        cmpq    %rsi, %r12
        movq    %rcx, %rbx
        setae   %al
        negq    %rax
        movq    %rax, %rdx
        andq    %rbp, %rax
        andq    %r13, %rdx
        addq %rax,%rbx
        adcq %rdx,%r12
        cmpq    %r12, %r13
        ja      .L38
        cmpq    %r13, %r12
        ja      .L41
        cmpq    %rbp, %rbx
        jb      .L38
.L41:
        subq %rbp,%rbx
        sbbq %r13,%r12
        .p2align 4,,5
        jmp     .L38

The label .L37 is the entry point to the loop, and .L32 is the function
epilogue. The cmpq %r12, %r13 followed by cmpq %r13, %r12 is a little
puzzling, there ought to be a single compare followed by one ja and one
jne or so. It's 33 instructions (if we believe the first branch at the
end is a likely one). divrem_2 seems to be a few instructions longer,
and with a branch in the middle which I guess is related to fraction
generation.

> I think you would get better (small operands) figures if you called
> divrem_2 directly, instead of going through mpn_tdiv_qr.  IIRC, the
> semantics of mpn_tdiv_qr will require a copy to be made.

But forcing it to copy is fair, if that's needed for not clobering the
input... (even though I guess the assembler implementation clobbers the
input only to store the remainder in the first two limbs).

I've now also tried calling divrem_2 directly (but with copying things
around to get the interface I want):

  void
  mod_2_ref (mp_ptr rp, mp_srcptr up, mp_size_t un, mp_ptr dp, mp_ptr scratch)
  {
    ASSERT (un >= 2);

    MPN_COPY (scratch, up, un);
    mpn_divrem_2 (scratch + un, 0, scratch, un, dp);
    MPN_COPY (rp, scratch, 2);
  }

And my C loop seems to still beat divrem_2 for n >= 3 (but with less
margin, around 5--8%).

>   Would it be ok to put the loop to generate fractional words in a
>   separate function?
>
> That's a good idea, I think.

So the fraction function(s) should take as inputs r, d with r < d and a
fraction size fn, and compute quotient and optionally remainder for the
division r B^fn / d. And with special functions for d of size 1 and 2.
Name for it?

> I'd like a set of plain functions with just a single simple loop.  In
> such lowest-level function testing whether to compute some type of
> inverse or not should not be done, whether divisors are normalised or
> not should not be done.

So, what should the interface be?

  void
  mpn_div_qr_2 (mp_ptr qp, mp_ptr rp, mp_srcptr up, mp_size_t un,
  	        mp_srcptr dp, mp_limb_t dinv);
  void
  mpn_div_r_2 (mp_ptr rp, mp_srcptr up, mp_size_t un,
  	       mp_srcptr dp, mp_limb_t dinv);

Subtleties: This makes sense for normalized inputs. But for unnormalized
inputs, we should also pass the shift count. I think you have planned to
change

  typedef struct {mp_limb_t inv32;} gmp_pi1_t;

to

  typedef struct {mp_limb_t inv32; unsigned shift;} gmp_pi1_t;

Is that the way to go here as well? Or maybe we should include the
shifted top limbs too, like

  typedef struct {
    mp_limb_t inv32;
    unsigned shift;
    mp_limb_t d0;
    mp_limb_t d1; /* Always normalized */
  } gmp_pi1_t;

If so, we get just

  void
  mpn_div_qr_2 (mp_ptr qp, mp_ptr rp, mp_srcptr up, mp_size_t un, gmp_pi1_t *d);

  void
  mpn_div_r_2 (mp_ptr rp, mp_srcptr up, mp_size_t un, gmp_pi1_t *d);

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.