Torbjorn Granlund tg at
Thu May 6 14:17:12 CEST 2010

nisse at (Niels Möller) writes:

  And here's a simple test program for mod_2 using a loop around
  udiv_qr_3by2. Beats current code (mpn_tdiv_qr, calling divrem_2) on both
  x86 (3% for large sizes, 15% sizes around 20) and x86_64 (10% for most
  sizes), even though divrem_2 seems to be implemented in assembler there.
On which machine are you testing the x86_64 code?

The x86_64/divrem_2.asm code uses the udiv_qr_3by2 algorithm, unless I
am much mistaken.  Does C code beat the assembly code for the same

Also x86/divrem_2.asm use the udiv_qr_3by2 algorithm.

(All other assembly files seem to be using the udiv_qr_3by2 algorithm,
except the Itanic code.  According to mpn/ia64/divrem_2.asm I have new
code for Itanic, as well as slightly improved code for x86_64.  See NEW
in the repo x86_64/divrem_2.asm.)

I think you would get better (small operands) figures if you called
divrem_2 directly, instead of going through mpn_tdiv_qr.  IIRC, the
semantics of mpn_tdiv_qr will require a copy to be made.  (Perhaps one
could avoid this copying when the dividend and remainder arguments are
the same, but that's a different question.)

It seems that mpn/generic/divrem_2.c still uses the old udiv_qr_2by1
style.  This is an ommission.

  Before rewriting divrem_2, I guess we should consider the interface. To
  me, it would make sense to have two separate functions, mpn_div_qr_2,
  mpn_div_r_2 (mpn_div_q_2 is less important, since the remainder is small
  and almost free).

Agreed.  This is true for all small-divisor division functions.

  What are the current users of the fraction feature?

Perhaps MPFR.  Our function mpn/generic/get_str.c passes 1 for the
fraction argument.

  Would it be ok to put the loop to generate fractional words in a
  separate function?

That's a good idea, I think.

I'd like a set of plain functions with just a single simple loop.  In
such lowest-level function testing whether to compute some type of
inverse or not should not be done, whether divisors are normalised or
not should not be done.

  I also think it's highly desirable to not clobber the large input, and
  instead extract and update the high limbs (possibly normalizing them)
  on the fly.

  Given the speedup, I think it would be a good idea to implement the
  basic udiv_qr_3by2 loop in C (it's most likely a good method for small
  sizes, and on architecture like sparc where umul is a lot slower than
  umullo), before figuring out if more sophisticated mod_2 variants using
  Montgomery's trick should also be used and before writing the assembly


More information about the gmp-devel mailing list