mod_2
Torbjorn Granlund
tg at gmplib.org
Thu May 6 16:26:49 CEST 2010
nisse at lysator.liu.se (Niels Möller) writes:
shell.gmplib.org.
My code was optimized for K8-K10.
> The x86_64/divrem_2.asm code uses the udiv_qr_3by2 algorithm, unless I
> am much mistaken. Does C code beat the assembly code for the same
> algorithm?
>
> Also x86/divrem_2.asm use the udiv_qr_3by2 algorithm.
Then my results are more unexpected. But there should be some possible
savings for not storing or adjusting the quotient.
Perhaps, but I think the saving is mainly that you run my code +
MPN_COPY and your code directly.
> I think you would get better (small operands) figures if you called
> divrem_2 directly, instead of going through mpn_tdiv_qr. IIRC, the
> semantics of mpn_tdiv_qr will require a copy to be made.
But forcing it to copy is fair, if that's needed for not clobering the
input... (even though I guess the assembler implementation clobbers the
input only to store the remainder in the first two limbs).
Perhaps it is "fair" by some definition, but if we are looking for a
fast division-by-2limb loop, then it is more useful to compare the loops
shoulder-to-shoulder.
I've now also tried calling divrem_2 directly (but with copying things
around to get the interface I want):
void
mod_2_ref (mp_ptr rp, mp_srcptr up, mp_size_t un, mp_ptr dp, mp_ptr scratch)
{
ASSERT (un >= 2);
MPN_COPY (scratch, up, un);
mpn_divrem_2 (scratch + un, 0, scratch, un, dp);
MPN_COPY (rp, scratch, 2);
}
And my C loop seems to still beat divrem_2 for n >= 3 (but with less
margin, around 5--8%).
That's interesting. How many cycles per limb does you code take?
So the fraction function(s) should take as inputs r, d with r < d and a
fraction size fn, and compute quotient and optionally remainder for the
division r B^fn / d. And with special functions for d of size 1 and 2.
Yes.
Name for it?
There are two major unsolved problems in computer science, whether P =
NP and naming things...
> I'd like a set of plain functions with just a single simple loop. In
> such lowest-level function testing whether to compute some type of
> inverse or not should not be done, whether divisors are normalised or
> not should not be done.
So, what should the interface be?
void
mpn_div_qr_2 (mp_ptr qp, mp_ptr rp, mp_srcptr up, mp_size_t un,
mp_srcptr dp, mp_limb_t dinv);
void
mpn_div_r_2 (mp_ptr rp, mp_srcptr up, mp_size_t un,
mp_srcptr dp, mp_limb_t dinv);
Subtleties: This makes sense for normalized inputs. But for unnormalized
inputs, we should also pass the shift count. I think you have planned to
change
typedef struct {mp_limb_t inv32;} gmp_pi1_t;
to
typedef struct {mp_limb_t inv32; unsigned shift;} gmp_pi1_t;
Is that the way to go here as well? Or maybe we should include the
shifted top limbs too, like
typedef struct {
mp_limb_t inv32;
unsigned shift;
mp_limb_t d0;
mp_limb_t d1; /* Always normalized */
} gmp_pi1_t;
If so, we get just
void
mpn_div_qr_2 (mp_ptr qp, mp_ptr rp, mp_srcptr up, mp_size_t un, gmp_pi1_t *d);
void
mpn_div_r_2 (mp_ptr rp, mp_srcptr up, mp_size_t un, gmp_pi1_t *d);
Many good questions...
What about first listing the loops, each corresponding to a low-lever
function, then when we're pretty sure we have the full set, we can start
worrying what food the functions will need?
--
Torbjörn
More information about the gmp-devel
mailing list