Alternative div_qr_1
Torbjorn Granlund
tg at gmplib.org
Sat May 8 19:53:28 CEST 2010
nisse at lysator.liu.se (Niels Möller) writes:
Here's an alternative implementation of div_qr_1, with nicer recurrency.
This is very cool. I looked into making quotient-full division out of
the new mod_1_N trick, but only considered N > 1, and there the
precomputed quotients become N limbs (plus up to two bits), which looked
bad. On 2nd thought, I think it is viable to base quotient-full code on
mod_1s_2p.
It's not fast, and I think this C implementation suffers badly from
having to check for carry from add_ssaaaa. (Is there any better way to
do that?)
I use the macros below for some experimental code. For your usage, you
need to initiate the s2 parameter to 0 before you invoke the macro.
/* Define some longlong.h-style macros, but for wider operations.
* add_sssaaaa is like longlong.h's add_ssaaaa but the propagating
carry-out into an additional sum opeand.
*/
#if (defined (__i386__) || defined (__i486__)) && W_TYPE_SIZE == 32
#define add_sssaaaa(s2, s1, s0, a1, a0, b1, b0) \
__asm__ ("add\t%7, %k2\n\tadc\t%5, %k1\n\tadc\t$0, %k0" \
: "=r" (s2), "=&r" (s1), "=&r" (s0) \
: "0" ((USItype)(s2)), \
"1" ((USItype)(a1)), "g" ((USItype)(b1)), \
"%2" ((USItype)(a0)), "g" ((USItype)(b0)))
#endif
#if defined (__amd64__) && W_TYPE_SIZE == 64
#define add_sssaaaa(s2, s1, s0, a1, a0, b1, b0) \
__asm__ ("add\t%7, %q2\n\tadc\t%5, %q1\n\tadc\t$0, %q0" \
: "=r" (s2), "=&r" (s1), "=&r" (s0) \
: "0" ((UDItype)(s2)), \
"1" ((UDItype)(a1)), "rme" ((UDItype)(b1)), \
"%2" ((UDItype)(a0)), "rme" ((UDItype)(b0)))
#endif
#if HAVE_HOST_CPU_FAMILY_powerpc && W_TYPE_SIZE == 64
#define add_sssaaaa(s2, s1, s0, a1, a0, b1, b0) \
__asm__ ("add%I7c\t%2,%6,%7\n\tadde\t%1,%4,%5\n\taddze\t%0,%0" \
: "=r" (s2), "=&r" (s1), "=&r" (s0) \
: "0" ((UDItype)(s2)), \
"r" ((UDItype)(a1)), "r" ((UDItype)(b1)), \
"%2" ((UDItype)(a0)), "rI" ((UDItype)(b0)))
#endif
#ifndef add_sssaaaa
#define add_sssaaaa(s2, s1, s0, a1, a0, b1, b0) \
do { \
UWtype __s0, __s1, __c0, __c1; \
__s0 = (a0) + (b0); \
__s1 = (a1) + (b1); \
__c0 = __s0 < (a0); \
__c1 = __s1 < (a1); \
(s0) = __s0; \
__s1 = __s1 + __c0; \
(s1) = __s1; \
(s2) += __c1 + (__s1 < __c0); \
} while (0)
#endif
The recurrency is via u1 and u0. In x86 assembler, it would be
mul
add
adc
lea
cmov
and some moves.
The u0 recurrency is surely dominated by the u1 recurrency.
The code above adds all quotient limbs together on the fly. I think that
should be reasonable, if the MPN_INCR_U results in an unlikely jump to a
carry propagation loop. But there are sure other ways to organize it.
MPN_INCR_U is written to work like that.
(E.g, store q0, q1 for even iterations in the array at qp, and use a
temporary storage area for the q0, q1 from odd iterations. And then q2
in each iteration can be added into the (q1, q0) two iterations earlier,
and there can be no carry except in the case that d divides B^2-1, or
equivalently, B^2 = 1 (mod d), which one could handle specially.
I am not sure I followed that.
One idea, perhaps it is what you suggest above, is to keep a little qp[]
window in registers, to make MPN_INCR_U run faster.
We now get something like
mov %rsi, -16(%rbp)
add %rcx, -8(%rbp)
jbe very_unlikely
which is two cache loads and one store. We could improve that to
add %rcx, %r15
mov %r15, -8(%rbp)
mov %rsi, %r15
jbe very_unlikely
On my x86_32 laptop it runs at 28 cycles per limb compared to 24 for
divrem_1. And on shell, 28 vs 19.
I coded the thing in assembly, with some initial help from gcc.
cpu old new
corei 19 17
athlon 13 12
This is not a great improvement, but I didn't try very hard. I am
confident that it is possible we save a few cycles. The critical path
is 10 cycles on athlon.
I have a div_qr_1_pi2 which runs at 10.3 cycles on athlon. It needs a
more expensive precomputation than the new algorithm.
BTW, as far as I see there's no mod_1 variant implementing the above
method?
The nice thing is that it allows arbitrary d, [snip]
We have mod_1_1p, which runs at 12.4 and 7 cycles, respectively, on the
above processors, and that code also handles any d. (Please see
gmplib.org/devel/asm.html for up-to-date ccyle numbers. That page is
continually updated.)
--
Torbjörn
More information about the gmp-devel
mailing list