Alternative div_qr_1
Niels Möller
nisse at lysator.liu.se
Sun May 9 01:18:56 CEST 2010
Torbjorn Granlund <tg at gmplib.org> writes:
> (E.g, store q0, q1 for even iterations in the array at qp, and use a
> temporary storage area for the q0, q1 from odd iterations. And then q2
> in each iteration can be added into the (q1, q0) two iterations earlier,
> and there can be no carry except in the case that d divides B^2-1, or
> equivalently, B^2 = 1 (mod d), which one could handle specially.
>
> I am not sure I followed that.
Each iteration produces a quotient that is two limbs and a bit, denote
them q2, q1, q0. Use ' do denote the results of different iterations.
Write down the stuff to add up for three iterations,
q2 q1 q0
q2' q1' q0'
q2'' q1'' q0''
Now, except for the case that d is a factor of B^2 - 1 (so that B^2 = 1
(mod d)), (q2, q1, q0) can be neither B^2-1 nor 2 B^2-1, and therefore
we can add <q1, q0> + q2'' without overflowing two limbs. So we can do
that addition without even an unlikely carry rippling further. If we add
in all those q2'':s to the (q1, q0) two iterations earlier, it remains to
add up
q1 q0 q1'' q0'' q1'''' q0'''' ...
q1' q0' q1''' q0''' ...
(Maybe not of much help, unless we're willing to postpone that bignum
addition until the end).
> I coded the thing in assembly, with some initial help from gcc.
>
> cpu old new
> corei 19 17
> athlon 13 12
>
> This is not a great improvement, but I didn't try very hard. I am
> confident that it is possible we save a few cycles. The critical path
> is 10 cycles on athlon.
Cool!
> We have mod_1_1p, which runs at 12.4 and 7 cycles, respectively, on the
> above processors, and that code also handles any d.
That will be tough to beat (when we stay with _1-type methods). If one
is really clever, maybe one can get the u1 recurrency down to mul + adc
(6 cycles on amd). One then has to handle carry out from that addition
using adjustment steps running in parallel with the mul of the next
iteration. Here's one (untested) variant:
mp_limb_t
mpn_mod_1 (mp_srcptr ap, mp_size_t n, mp_limb_t b)
{
mp_limb_t binv;
mp_limb_t B2;
mp_limb_t a2, a1, a0;
mp_limb_t r0, r1;
mp_limb_t q;
ASSERT (b & GMP_LIMB_HIGHBIT);
invert_limb (binv, b);
B2 = - binv * b; /* B^2 mod b */
a2 = 0; a1 = ap[n-1]; a0 = ap[n-2];
for (; n > 2; n--)
{
umul_ppmm (r1, r0, a1, B2);
ADDC_LIMB (a2, a0, a0, (-a2) & B2);
add_ssaaaa (a1, a0, a0, ap[n-3], r1, r0);
/* Clumsy way to detect carry */
a2 += (a1 < r1 || (a1 == r1 && a0 < r0));
}
add_ssaaaa (a1, a0, a1, a0, 0, (-a2) & B2);
if (a1 >= d)
a1 -= d;
udiv_qrnnd_preinv (q, a0, a1, a0, b, binv);
return a0;
}
The inner loop should be something like (on x86, with a1 in %eax):
loop:
mul B2 C 0 6 12 (AMD cycles)
add a0, t C 0 8 ...
sbb a2, a2 C 1 9
mov (%esi, j), a0
add %eax, a0 C 4 10
mov t, %eax C 1 9
adc %edx, %eax C 5 11
xor t, t
sbb $0, a2 C 6 12
cmovne B2, t C 7 13
dec j
bne loop
There's some juggling of values to get the right stuff in %eax at the
right time, but if the register renaming machinery is friendly (the
critical thing is that the mov t, %eax must run in the same cycle (or
earlier) as the add %eax, a0), it should be possible to run in 6 cycles.
And with 5 free decoder slots, if we want to add some instructions to
also keep track of the quotient.
And now it's time to go to bed ;-)
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list