tg at gmplib.org
Thu Jun 17 15:00:54 CEST 2010
nisse at lysator.liu.se (Niels Möller) writes:
Torbjorn Granlund <tg at gmplib.org> writes:
> I suppose we should replace the generic/mod_1_1.c?
Maybe. I haven't tried to seriously benchmark or optimize the C
implementation. This is what it looks like (for the normalized case), if
you'd like to try it out.
Yes, I think we should really replace the default C mpn_mod_1_1p code
with your variant. Mainly since it saves an umul_ppmm, which ranslates
to many cycles for lots of different processors.
I see that your C code needs n >= 3, which might make sense when one
does not have divisor invariance beyond a single dividend. I think
requiring n >= N or perhaps n > N in mpn_mod_1s_Np does make some sense,
since it will simplify the functions and save some cycles.
I have a sketch which I think should run in 8 or maybe 7.5 cycles per
iteration. I.e., same speed as current mod_1s_2p (or maybe slightly
faster, 3.75 cycles per limb rather than 4). But unlike the current
code, it allows arbitrary d.
# r0 in %rax # cycle numbers diff
mov 8(up, un, 8), t1
lea (b3md. t1), t2
add r2, t1 # 0 10 17 25 32 8,7
cmovc t2, t1 # 1 11 18 26 33 8,7
mul b2 # 0 6 15 22 30 7,8
mov (up, un), r0
add %rax, r0 # 4 10 19 26 34 7,8
adc %rdx, t1 # 5 12 20 27 35 7,8
** mov r1, %rax # 0 9 16 24 31 8,7
lea (-d, t1), r1 # 6 13 21 28 36 7,8
cmovnc t1, r1 # 7 14 22 29 37 7,8
mul b3 # 1 10 17 25 32 8,7
xor r2, r2
add r0, %rax # 5 14 21 29 36 8,7
adc %rdx, r1 # 8 15 23 30 38 7,8
cmovc b3, r2 # 9 16 24 31 39 7,8
Also mod_1_4 should be possible at the same speed as current code,
but with arbitrary d.
The first conditional subtraction of d in r1 + b3 -? d is fairly cheap
since b3 is invariant, but the second conditional instruction adds
several cycles of critical latency.
I see. Perhaps a cycle could be saved by not copying r1 to rax (marked
with **) but instead copy b3 to rax, and replace 'mul b3' by 'mul r1'?
This will make it necessary to use two registers for r1, so that even
iteration use one register and odd iterations use another.
Is it reasonable to have a table lookup in the inner loop? Then it might
be a good idea to collect all carries into r2 (unlike the above code
which puts only the carry for the final addition into r2, to limit r2 to
the values 0 and 1), and use table lookup to find r2 B^3 (mod d) for the
small range of possible values for r2.
Sure, as long as the table creation does not negate the saved cycles;
even mod_1_2 will only by used up to quite limited sizes.
More information about the gmp-devel