Alternative div_qr_1

Thu Jun 17 15:00:54 CEST 2010

nisse at lysator.liu.se (Niels Möller) writes:

  Torbjorn Granlund <tg at gmplib.org> writes:

  > I suppose we should replace the generic/mod_1_1.c?

  Maybe. I haven't tried to seriously benchmark or optimize the C
  implementation. This is what it looks like (for the normalized case), if
  you'd like to try it out.

Yes, I think we should really replace the default C mpn_mod_1_1p code
with your variant.  Mainly since it saves an umul_ppmm, which ranslates
to many cycles for lots of different processors.

I see that your C code needs n >= 3, which might make sense when one
does not have divisor invariance beyond a single dividend.  I think
requiring n >= N or perhaps n > N in mpn_mod_1s_Np does make some sense,
since it will simplify the functions and save some cycles.

  I have a sketch which I think should run in 8 or maybe 7.5 cycles per
  iteration. I.e., same speed as current mod_1s_2p (or maybe slightly
  faster, 3.75 cycles per limb rather than 4). But unlike the current
  code, it allows arbitrary d.

  	# r0 in %rax            # cycle numbers   diff
  	mov  8(up, un, 8), t1 
  	lea  (b3md. t1), t2
  	add  r2, t1		# 0 10 17 25 32   8,7
  	cmovc	 t2, t1	   	# 1 11 18 26 33   8,7
  	mul	 b2  	   	# 0  6 15 22 30   7,8
  	mov	 (up, un), r0
  	add	 %rax, r0  	# 4 10 19 26 34   7,8
  	adc	 %rdx, t1  	# 5 12 20 27 35   7,8
**  	mov	 r1, %rax  	# 0  9 16 24 31   8,7
  	lea	 (-d, t1), r1	# 6 13 21 28 36   7,8
  	cmovnc	 t1, r1	   	# 7 14 22 29 37   7,8
  	mul	 b3  		# 1 10 17 25 32   8,7
  	xor	 r2, r2
  	add	 r0, %rax	# 5 14 21 29 36   8,7
  	adc	 %rdx, r1	# 8 15 23 30 38   7,8
  	cmovc	 b3, r2		# 9 16 24 31 39   7,8

  Also mod_1_4 should be possible at the same speed as current code,
  but with arbitrary d.

  The first conditional subtraction of d in r1 + b3 -? d is fairly cheap
  since b3 is invariant, but the second conditional instruction adds
  several cycles of critical latency.

I see.  Perhaps a cycle could be saved by not copying r1 to rax (marked
with **) but instead copy b3 to rax, and replace 'mul b3' by 'mul r1'?
This will make it necessary to use two registers for r1, so that even
iteration use one register and odd iterations use another.

  Is it reasonable to have a table lookup in the inner loop? Then it might
  be a good idea to collect all carries into r2 (unlike the above code
  which puts only the carry for the final addition into r2, to limit r2 to
  the values 0 and 1), and use table lookup to find r2 B^3 (mod d) for the
  small range of possible values for r2.

Sure, as long as the table creation does not negate the saved cycles;
even mod_1_2 will only by used up to quite limited sizes.

-- 
Torbjörn