mod_2
Niels Möller
nisse at lysator.liu.se
Thu May 6 15:45:33 CEST 2010
Torbjorn Granlund <tg at gmplib.org> writes:
> On which machine are you testing the x86_64 code?
shell.gmplib.org.
> The x86_64/divrem_2.asm code uses the udiv_qr_3by2 algorithm, unless I
> am much mistaken. Does C code beat the assembly code for the same
> algorithm?
>
> Also x86/divrem_2.asm use the udiv_qr_3by2 algorithm.
Then my results are more unexpected. But there should be some possible
savings for not storing or adjusting the quotient. I think this is the
inner-loop generated by gcc:
.p2align 4,,7
.L38:
subq $1, %r15
subq $8, %rdi
cmpq $2, %r15
je .L32
.L37:
movq %r12, %rax
movq -24(%rdi), %rcx
mulq %r9
movq %rax, %rsi
addq %rbx,%rsi
adcq %r12,%rdx
movq %rdx, %rax
movq %rbx, %r12
imulq %r13, %rax
subq %rax, %r12
movq %rbp, %rax
subq %rbp,%rcx
sbbq %r13,%r12
mulq %rdx
subq %rax,%rcx
sbbq %rdx,%r12
xorl %eax, %eax
cmpq %rsi, %r12
movq %rcx, %rbx
setae %al
negq %rax
movq %rax, %rdx
andq %rbp, %rax
andq %r13, %rdx
addq %rax,%rbx
adcq %rdx,%r12
cmpq %r12, %r13
ja .L38
cmpq %r13, %r12
ja .L41
cmpq %rbp, %rbx
jb .L38
.L41:
subq %rbp,%rbx
sbbq %r13,%r12
.p2align 4,,5
jmp .L38
The label .L37 is the entry point to the loop, and .L32 is the function
epilogue. The cmpq %r12, %r13 followed by cmpq %r13, %r12 is a little
puzzling, there ought to be a single compare followed by one ja and one
jne or so. It's 33 instructions (if we believe the first branch at the
end is a likely one). divrem_2 seems to be a few instructions longer,
and with a branch in the middle which I guess is related to fraction
generation.
> I think you would get better (small operands) figures if you called
> divrem_2 directly, instead of going through mpn_tdiv_qr. IIRC, the
> semantics of mpn_tdiv_qr will require a copy to be made.
But forcing it to copy is fair, if that's needed for not clobering the
input... (even though I guess the assembler implementation clobbers the
input only to store the remainder in the first two limbs).
I've now also tried calling divrem_2 directly (but with copying things
around to get the interface I want):
void
mod_2_ref (mp_ptr rp, mp_srcptr up, mp_size_t un, mp_ptr dp, mp_ptr scratch)
{
ASSERT (un >= 2);
MPN_COPY (scratch, up, un);
mpn_divrem_2 (scratch + un, 0, scratch, un, dp);
MPN_COPY (rp, scratch, 2);
}
And my C loop seems to still beat divrem_2 for n >= 3 (but with less
margin, around 5--8%).
> Would it be ok to put the loop to generate fractional words in a
> separate function?
>
> That's a good idea, I think.
So the fraction function(s) should take as inputs r, d with r < d and a
fraction size fn, and compute quotient and optionally remainder for the
division r B^fn / d. And with special functions for d of size 1 and 2.
Name for it?
> I'd like a set of plain functions with just a single simple loop. In
> such lowest-level function testing whether to compute some type of
> inverse or not should not be done, whether divisors are normalised or
> not should not be done.
So, what should the interface be?
void
mpn_div_qr_2 (mp_ptr qp, mp_ptr rp, mp_srcptr up, mp_size_t un,
mp_srcptr dp, mp_limb_t dinv);
void
mpn_div_r_2 (mp_ptr rp, mp_srcptr up, mp_size_t un,
mp_srcptr dp, mp_limb_t dinv);
Subtleties: This makes sense for normalized inputs. But for unnormalized
inputs, we should also pass the shift count. I think you have planned to
change
typedef struct {mp_limb_t inv32;} gmp_pi1_t;
to
typedef struct {mp_limb_t inv32; unsigned shift;} gmp_pi1_t;
Is that the way to go here as well? Or maybe we should include the
shifted top limbs too, like
typedef struct {
mp_limb_t inv32;
unsigned shift;
mp_limb_t d0;
mp_limb_t d1; /* Always normalized */
} gmp_pi1_t;
If so, we get just
void
mpn_div_qr_2 (mp_ptr qp, mp_ptr rp, mp_srcptr up, mp_size_t un, gmp_pi1_t *d);
void
mpn_div_r_2 (mp_ptr rp, mp_srcptr up, mp_size_t un, gmp_pi1_t *d);
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list