Alternative div_qr_1

Sat May 8 19:53:28 CEST 2010

nisse at lysator.liu.se (Niels Möller) writes:

  Here's an alternative implementation of div_qr_1, with nicer recurrency.

This is very cool.  I looked into making quotient-full division out of
the new mod_1_N trick, but only considered N > 1, and there the
precomputed quotients become N limbs (plus up to two bits), which looked
bad.  On 2nd thought, I think it is viable to base quotient-full code on
mod_1s_2p.

  It's not fast, and I think this C implementation suffers badly from
  having to check for carry from add_ssaaaa. (Is there any better way to
  do that?)

I use the macros below for some experimental code.  For your usage, you
need to initiate the s2 parameter to 0 before you invoke the macro.

/* Define some longlong.h-style macros, but for wider operations.
   * add_sssaaaa is like longlong.h's add_ssaaaa but the propagating
     carry-out into an additional sum opeand.
*/

#if (defined (__i386__) || defined (__i486__)) && W_TYPE_SIZE == 32
#define add_sssaaaa(s2, s1, s0, a1, a0, b1, b0)				\
  __asm__ ("add\t%7, %k2\n\tadc\t%5, %k1\n\tadc\t$0, %k0"		\
	   : "=r" (s2), "=&r" (s1), "=&r" (s0)				\
	   : "0"  ((USItype)(s2)),					\
	     "1"  ((USItype)(a1)), "g" ((USItype)(b1)),			\
	     "%2" ((USItype)(a0)), "g" ((USItype)(b0)))
#endif

#if defined (__amd64__) && W_TYPE_SIZE == 64
#define add_sssaaaa(s2, s1, s0, a1, a0, b1, b0)				\
  __asm__ ("add\t%7, %q2\n\tadc\t%5, %q1\n\tadc\t$0, %q0"		\
	   : "=r" (s2), "=&r" (s1), "=&r" (s0)				\
	   : "0"  ((UDItype)(s2)),					\
	     "1"  ((UDItype)(a1)), "rme" ((UDItype)(b1)),		\
	     "%2" ((UDItype)(a0)), "rme" ((UDItype)(b0)))
#endif

#if HAVE_HOST_CPU_FAMILY_powerpc && W_TYPE_SIZE == 64
#define add_sssaaaa(s2, s1, s0, a1, a0, b1, b0)				\
  __asm__ ("add%I7c\t%2,%6,%7\n\tadde\t%1,%4,%5\n\taddze\t%0,%0"	\
	   : "=r" (s2), "=&r" (s1), "=&r" (s0)				\
	   : "0"  ((UDItype)(s2)),					\
	     "r"  ((UDItype)(a1)), "r" ((UDItype)(b1)),			\
	     "%2" ((UDItype)(a0)), "rI" ((UDItype)(b0)))
#endif

#ifndef add_sssaaaa
#define add_sssaaaa(s2, s1, s0, a1, a0, b1, b0)				\
  do {									\
    UWtype __s0, __s1, __c0, __c1;					\
    __s0 = (a0) + (b0);							\
    __s1 = (a1) + (b1);							\
    __c0 = __s0 < (a0);							\
    __c1 = __s1 < (a1);							\
    (s0) = __s0;							\
    __s1 = __s1 + __c0;							\
    (s1) = __s1;							\
    (s2) += __c1 + (__s1 < __c0);					\
  } while (0)
#endif

  The recurrency is via u1 and u0. In x86 assembler, it would be

  	mul
          add
    	adc
          lea
          cmov

  and some moves.

The u0 recurrency is surely dominated by the u1 recurrency.

  The code above adds all quotient limbs together on the fly. I think that
  should be reasonable, if the MPN_INCR_U results in an unlikely jump to a
  carry propagation loop. But there are sure other ways to organize it.

MPN_INCR_U is written to work like that.

  (E.g, store q0, q1 for even iterations in the array at qp, and use a
  temporary storage area for the q0, q1 from odd iterations. And then q2
  in each iteration can be added into the (q1, q0) two iterations earlier,
  and there can be no carry except in the case that d divides B^2-1, or
  equivalently, B^2 = 1 (mod d), which one could handle specially.

I am not sure I followed that.

One idea, perhaps it is what you suggest above, is to keep a little qp[]
window in registers, to make MPN_INCR_U run faster.

We now get something like

        mov     %rsi, -16(%rbp)
        add     %rcx, -8(%rbp)
        jbe     very_unlikely

which is two cache loads and one store.  We could improve that to

        add     %rcx, %r15
        mov     %r15, -8(%rbp)
        mov     %rsi, %r15
        jbe     very_unlikely

  On my x86_32 laptop it runs at 28 cycles per limb compared to 24 for
  divrem_1. And on shell, 28 vs 19.

I coded the thing in assembly, with some initial help from gcc.

cpu     old new
corei   19  17
athlon  13  12

This is not a great improvement, but I didn't try very hard.  I am
confident that it is possible we save a few cycles.  The critical path
is 10 cycles on athlon.

I have a div_qr_1_pi2 which runs at 10.3 cycles on athlon.  It needs a
more expensive precomputation than the new algorithm.

  BTW, as far as I see there's no mod_1 variant implementing the above
  method?

  The nice thing is that it allows arbitrary d, [snip]

We have mod_1_1p, which runs at 12.4 and 7 cycles, respectively, on the
above processors, and that code also handles any d.  (Please see
gmplib.org/devel/asm.html for up-to-date ccyle numbers.  That page is
continually updated.)

-- 
Torbjörn