Niels Möller
nisse at lysator.liu.se
Thu Jun 17 12:31:55 CEST 2010
Torbjorn Granlund <tg at gmplib.org> writes:
> I suppose we should replace the generic/mod_1_1.c?
Maybe. I haven't tried to seriously benchmark or optimize the C
implementation. This is what it looks like (for the normalized case), if
you'd like to try it out.
mp_limb_t
div_r_1_preinv (mp_srcptr up, mp_size_t un,
mp_limb_t d, mp_srcptr pre)
{
mp_limb_t dinv;
mp_limb_t B2;
mp_limb_t r0, r1, r2;
mp_limb_t p0, p1;
mp_size_t j;
ASSERT (d & GMP_LIMB_HIGHBIT);
ASSERT (un >= 3);
dinv = pre[0];
B2 = pre[1];
umul_ppmm (p1, p0, up[un-1], B2);
r2 = 0;
add_sssaaaa (r2, r1, r0, up[un-2], up[un-3], p1, p0);
for (j = un-4; j >= 0; j--)
{
mp_limb_t mask;
mp_limb_t cy;
mask = -r2;
umul_ppmm (p1, p0, r1, B2);
ADDC_LIMB (cy, r0, r0, mask & B2);
r0 -= (-cy) & d;
r2 = 0;
add_sssaaaa (r2, r1, r0, r0, up[j], p1, p0);
}
if (r2 > 0)
r1 -= d;
if (r1 >= d)
r1 -= d;
mod_rnnd_preinv (r0, r1, r0, d, dinv);
return r0;
}
Here, add_sssaaaa is the macro you mailed the other day, and
mod_rnnd_preinv is udiv_qrnnd_preinv with q argument and updates
omitted. One could use a variant of add_sssaaa which generates the carry
into a mask directly, using sbb r2, r2 or whatever is appropriate for
the cpu.
> Have you looked into a mod_1_2 using the same ideas? Perhaps it will be
> tricky to get that to be as fast as possible without further restricting
> the divisor range?
I have a sketch which I think should run in 8 or maybe 7.5 cycles per
iteration. I.e., same speed as current mod_1s_2p (or maybe slightly
faster, 3.75 cycles per limb rather than 4). But unlike the current
code, it allows arbitrary d.
# r0 in %rax # cycle numbers diff
mov 8(up, un, 8), t1
lea (b3md. t1), t2
add r2, t1 # 0 10 17 25 32 8,7
cmovc t2, t1 # 1 11 18 26 33 8,7
mul b2 # 0 6 15 22 30 7,8
mov (up, un), r0
add %rax, r0 # 4 10 19 26 34 7,8
adc %rdx, t1 # 5 12 20 27 35 7,8
mov r1, %rax # 0 9 16 24 31 8,7
lea (-d, t1), r1 # 6 13 21 28 36 7,8
cmovnc t1, r1 # 7 14 22 29 37 7,8
mul b3 # 1 10 17 25 32 8,7
xor r2, r2
add r0, %rax # 5 14 21 29 36 8,7
adc %rdx, r1 # 8 15 23 30 38 7,8
cmovc b3, r2 # 9 16 24 31 39 7,8
Also mod_1_4 should be possible at the same speed as current code,
but with arbitrary d.
The first conditional subtraction of d in r1 + b3 -? d is fairly cheap
since b3 is invariant, but the second conditional instruction adds
several cycles of critical latency.
Is it reasonable to have a table lookup in the inner loop? Then it might
be a good idea to collect all carries into r2 (unlike the above code
which puts only the carry for the final addition into r2, to limit r2 to
the values 0 and 1), and use table lookup to find r2 B^3 (mod d) for the
small range of possible values for r2.
