tg at gmplib.org
Thu May 6 14:17:12 CEST 2010
nisse at lysator.liu.se (Niels Möller) writes:
And here's a simple test program for mod_2 using a loop around
udiv_qr_3by2. Beats current code (mpn_tdiv_qr, calling divrem_2) on both
x86 (3% for large sizes, 15% sizes around 20) and x86_64 (10% for most
sizes), even though divrem_2 seems to be implemented in assembler there.
On which machine are you testing the x86_64 code?
The x86_64/divrem_2.asm code uses the udiv_qr_3by2 algorithm, unless I
am much mistaken. Does C code beat the assembly code for the same
Also x86/divrem_2.asm use the udiv_qr_3by2 algorithm.
(All other assembly files seem to be using the udiv_qr_3by2 algorithm,
except the Itanic code. According to mpn/ia64/divrem_2.asm I have new
code for Itanic, as well as slightly improved code for x86_64. See NEW
in the repo x86_64/divrem_2.asm.)
I think you would get better (small operands) figures if you called
divrem_2 directly, instead of going through mpn_tdiv_qr. IIRC, the
semantics of mpn_tdiv_qr will require a copy to be made. (Perhaps one
could avoid this copying when the dividend and remainder arguments are
the same, but that's a different question.)
It seems that mpn/generic/divrem_2.c still uses the old udiv_qr_2by1
style. This is an ommission.
Before rewriting divrem_2, I guess we should consider the interface. To
me, it would make sense to have two separate functions, mpn_div_qr_2,
mpn_div_r_2 (mpn_div_q_2 is less important, since the remainder is small
and almost free).
Agreed. This is true for all small-divisor division functions.
What are the current users of the fraction feature?
Perhaps MPFR. Our function mpn/generic/get_str.c passes 1 for the
Would it be ok to put the loop to generate fractional words in a
That's a good idea, I think.
I'd like a set of plain functions with just a single simple loop. In
such lowest-level function testing whether to compute some type of
inverse or not should not be done, whether divisors are normalised or
not should not be done.
I also think it's highly desirable to not clobber the large input, and
instead extract and update the high limbs (possibly normalizing them)
on the fly.
Given the speedup, I think it would be a good idea to implement the
basic udiv_qr_3by2 loop in C (it's most likely a good method for small
sizes, and on architecture like sparc where umul is a lot slower than
umullo), before figuring out if more sophisticated mod_2 variants using
Montgomery's trick should also be used and before writing the assembly
More information about the gmp-devel