Alternative div_qr_1

Sun May 9 01:18:56 CEST 2010

Torbjorn Granlund <tg at gmplib.org> writes:

>   (E.g, store q0, q1 for even iterations in the array at qp, and use a
>   temporary storage area for the q0, q1 from odd iterations. And then q2
>   in each iteration can be added into the (q1, q0) two iterations earlier,
>   and there can be no carry except in the case that d divides B^2-1, or
>   equivalently, B^2 = 1 (mod d), which one could handle specially.
>   
> I am not sure I followed that.

Each iteration produces a quotient that is two limbs and a bit, denote
them q2, q1, q0. Use ' do denote the results of different iterations.
Write down the stuff to add up for three iterations,

  q2   q1   q0
       q2'  q1'  q0'
            q2'' q1'' q0''

Now, except for the case that d is a factor of B^2 - 1 (so that B^2 = 1
(mod d)), (q2, q1, q0) can be neither B^2-1 nor 2 B^2-1, and therefore
we can add <q1, q0> + q2'' without overflowing two limbs. So we can do
that addition without even an unlikely carry rippling further. If we add
in all those q2'':s to the (q1, q0) two iterations earlier, it remains to
add up

  q1  q0  q1'' q0''  q1'''' q0'''' ...
      q1' q0'  q1''' q0'''         ...

(Maybe not of much help, unless we're willing to postpone that bignum
addition until the end).

> I coded the thing in assembly, with some initial help from gcc.
>
> cpu     old new
> corei   19  17
> athlon  13  12
>
> This is not a great improvement, but I didn't try very hard.  I am
> confident that it is possible we save a few cycles.  The critical path
> is 10 cycles on athlon.

Cool!

> We have mod_1_1p, which runs at 12.4 and 7 cycles, respectively, on the
> above processors, and that code also handles any d.

That will be tough to beat (when we stay with _1-type methods). If one
is really clever, maybe one can get the u1 recurrency down to mul + adc
(6 cycles on amd). One then has to handle carry out from that addition
using adjustment steps running in parallel with the mul of the next
iteration. Here's one (untested) variant:

mp_limb_t
mpn_mod_1 (mp_srcptr ap, mp_size_t n, mp_limb_t b)
{
  mp_limb_t binv;
  mp_limb_t B2;
  mp_limb_t a2, a1, a0;
  mp_limb_t r0, r1;
  mp_limb_t q;

  ASSERT (b & GMP_LIMB_HIGHBIT);

  invert_limb (binv, b);

  B2 = - binv * b; /* B^2 mod b */

  a2 = 0; a1 = ap[n-1]; a0 = ap[n-2];

  for (; n > 2; n--)
    {
      umul_ppmm (r1, r0, a1, B2);
      ADDC_LIMB (a2, a0, a0, (-a2) & B2);
      add_ssaaaa (a1, a0, a0, ap[n-3], r1, r0);

      /* Clumsy way to detect carry */
      a2 += (a1 < r1 || (a1 == r1 && a0 < r0));
    }

  add_ssaaaa (a1, a0, a1, a0, 0, (-a2) & B2);
  if (a1 >= d)
    a1 -= d;

  udiv_qrnnd_preinv (q, a0, a1, a0, b, binv);

  return a0;
}

The inner loop should be something like (on x86, with a1 in %eax):

loop:
	mul	B2		C 0  6 12  (AMD cycles)
        add	a0, t		C 0  8 ...
        sbb	a2, a2		C 1  9
        mov	(%esi, j), a0
        add	%eax, a0	C 4 10
        mov	t, %eax		C 1  9
        adc	%edx, %eax	C 5 11
        xor	t, t
        sbb	$0, a2		C 6 12
        cmovne	B2, t		C 7 13
        dec	j
        bne	loop

There's some juggling of values to get the right stuff in %eax at the
right time, but if the register renaming machinery is friendly (the
critical thing is that the mov t, %eax must run in the same cycle (or
earlier) as the add %eax, a0), it should be possible to run in 6 cycles.
And with 5 free decoder slots, if we want to add some instructions to
also keep track of the quotient.

And now it's time to go to bed ;-)

/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.