Torbjorn Granlund tg at
Thu May 6 17:56:20 CEST 2010

nisse at (Niels Möller) writes:

  Torbjorn Granlund <tg at> writes:
  > Perhaps, but I think the saving is mainly that you run my code +
  > MPN_COPY and your code directly.
  Sure, that adds an O(n) cost. I don't understand the loops fully, but I
  think the one gcc generates for my code is also a few instructions
  shorter. Possible reasons for savings: No fraction logic, and no need to
  maintain, adjust and store q after the iteration's final umul.
my divrem_2:                       36 ('NEW' disabled)
my divrem_2:                       35 ('NEW' enabled)
your mod_2, gcc-?.?                32 (assuming first branch is taken)
Just-checked-in divrem_2, gcc-4.4: 35

The just-checked-in divrem_2.c separates the integer and fraction loops,
which saves some insns.  There is still code for updating and storing
the quotient.
  I'm also trying to argue that div_r_2 can be faster than div_qr_2 (and
  we haven't started looking at Montgomery variants).
I agree that it can be faster on some platforms.  But usually I'd expect
the speed limiter will be the critical path instructions.

Speed aside, it is useful to not need to allocate space for and pass a
quotient argument.

But there is also an argument against it: We're making a heaver assembly
writing burned on ourselves.

(I haven't read you "Montgomery quotient" message yet.)

  I measure it to 32 cycles per limb (at 1000 limbs) vs 34.5 for divrem_2.
According to tune/speed divrem_2 takes 33 in shell (or with NEW defined,
it takes 32).  The code runs at 21 and 19, respectively cycles on AMD

  > What about first listing the loops, each corresponding to a low-level
  > function, then when we're pretty sure we have the full set, we can start
  > worrying what food the functions will need?
  div_r  (motivated mainly for saving storage, in particular when the
          dividend is much larger than the divisor)
  frac   (do we need separate frac_qr, frac_q, frac_r?)
I started something at  The frac function are
missing.  Should we for the benefit of old and embedded processors
provide div_*_1 loops that actually use a 2/1 division instruction?

  gcd_1 (two single limb operands, unlike current mpn_gcd_1. For
         gcd(bignum, limb), I'd prefer to have the logic to call mod_1
         followed by gcd_1 in C, to simplify assembler implementations.
Yes, the current mpn_gcd_1 is really silly, calling mod_1.  I suppose we
should add a mpn_gcd_11, taking two single-limb operands.

  gcd_2 (currently defined statically in mpn/gcd.c)
Accepting two 2-limb operands?

  gcdext_1 (exists, but I'm not sure what the interface should be)
  gcdext_2 (not written)

  jacobi_1 (current mpn_jacobi_base, but perhaps allowing b == 1)
  jacobi_2 (currently defined statically in jacobi_lehmer.c)

You're the expert here.


More information about the gmp-devel mailing list