Alternative div_qr_1

Thu Jun 17 12:31:55 CEST 2010

Torbjorn Granlund <tg at gmplib.org> writes:

> I suppose we should replace the generic/mod_1_1.c?

Maybe. I haven't tried to seriously benchmark or optimize the C
implementation. This is what it looks like (for the normalized case), if
you'd like to try it out.

  mp_limb_t
  div_r_1_preinv (mp_srcptr up, mp_size_t un,
  		mp_limb_t d, mp_srcptr pre)
  {
    mp_limb_t dinv;
    mp_limb_t B2;
    mp_limb_t r0, r1, r2;
    mp_limb_t p0, p1;
    mp_size_t j;

    ASSERT (d & GMP_LIMB_HIGHBIT);
    ASSERT (un >= 3);

    dinv = pre[0];
    B2 = pre[1];

    umul_ppmm (p1, p0, up[un-1], B2);
    r2 = 0;
    add_sssaaaa (r2, r1, r0, up[un-2], up[un-3], p1, p0);

    for (j = un-4; j >= 0; j--)
      {
        mp_limb_t mask;
        mp_limb_t cy;
        mask = -r2;

        umul_ppmm (p1, p0, r1, B2);
        ADDC_LIMB (cy, r0, r0, mask & B2);
        r0 -= (-cy) & d;
        r2 = 0;
        add_sssaaaa (r2, r1, r0, r0, up[j], p1, p0);
      }

    if (r2 > 0)
      r1 -= d;

    if (r1 >= d)
      r1 -= d;

    mod_rnnd_preinv (r0, r1, r0, d, dinv);
    return r0;
  }

Here, add_sssaaaa is the macro you mailed the other day, and
mod_rnnd_preinv is udiv_qrnnd_preinv with q argument and updates
omitted. One could use a variant of add_sssaaa which generates the carry
into a mask directly, using sbb r2, r2 or whatever is appropriate for
the cpu.

> Have you looked into a mod_1_2 using the same ideas?  Perhaps it will be
> tricky to get that to be as fast as possible without further restricting
> the divisor range?

I have a sketch which I think should run in 8 or maybe 7.5 cycles per
iteration. I.e., same speed as current mod_1s_2p (or maybe slightly
faster, 3.75 cycles per limb rather than 4). But unlike the current
code, it allows arbitrary d.

	# r0 in %rax            # cycle numbers   diff
	mov  8(up, un, 8), t1 
	lea  (b3md. t1), t2
	add  r2, t1		# 0 10 17 25 32   8,7
	cmovc	 t2, t1	   	# 1 11 18 26 33   8,7
	mul	 b2  	   	# 0  6 15 22 30   7,8
	mov	 (up, un), r0
	add	 %rax, r0  	# 4 10 19 26 34   7,8
	adc	 %rdx, t1  	# 5 12 20 27 35   7,8
	mov	 r1, %rax  	# 0  9 16 24 31   8,7
	lea	 (-d, t1), r1	# 6 13 21 28 36   7,8
	cmovnc	 t1, r1	   	# 7 14 22 29 37   7,8
	mul	 b3  		# 1 10 17 25 32   8,7
	xor	 r2, r2
	add	 r0, %rax	# 5 14 21 29 36   8,7
	adc	 %rdx, r1	# 8 15 23 30 38   7,8
	cmovc	 b3, r2		# 9 16 24 31 39   7,8

Also mod_1_4 should be possible at the same speed as current code,
but with arbitrary d.

The first conditional subtraction of d in r1 + b3 -? d is fairly cheap
since b3 is invariant, but the second conditional instruction adds
several cycles of critical latency.

Is it reasonable to have a table lookup in the inner loop? Then it might
be a good idea to collect all carries into r2 (unlike the above code
which puts only the carry for the final addition into r2, to limit r2 to
the values 0 and 1), and use table lookup to find r2 B^3 (mod d) for the
small range of possible values for r2.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.