Alternative div_qr_1

Tue May 11 14:38:40 CEST 2010

Torbjorn Granlund <tg at gmplib.org> writes:

> You actually wrote an x86_64 assembly mod_1_1 some time ago that you
> claimed ran in 6 c/l.

It took me some time to dig it up, but I found this in an email I wrote
Aug 8 last year.

  mul	b2		C 0, 6 
  and	b3, r2		C 1
  mov	u[j], t0	C
  mov	r0, t1		C
  add	r2, t0		C 2
  adc	$0, t1		C 3
  sbb	r2, r2		C 4
  add	%rax, t0	C 4
  adc	%rdx, t1	C 5
  sbb	$0, r2	 	C 6
  mov	t0, r0
  mov	t1, %rax

The algorithm is really simple, the iteration is

  <r2, r1, r0>  <-- <r0, u[j]> + r1 b2 + r2 b3

where b2 = B^2 (mod d), b3 = B^3 (mod d). r2 is one-bit number,
represented so that r2 b3 is an and operation, and the right hand side
is less than B^2 + B d.

Expessing the other algorithm in similar form, it was

  <r2, r1, r0>  <-- <r0, u[j]> + r1 b2 + r2 b2 (B - d) 

We see that it differs only in the constant coefficient for r2, and
clearly b2 (B - d) = b3 (mod d).

The assembler code above needs some rearrangement to get the mov t1,
%rax off the recurrency chain. This variant might work better:

  mul	b2		C 0  6 
  and	b3, r2		C 0  7
  mov	u[j], t0	C
  add	r2, t0		C 1  8
  adc	$0, t1		C 2  9
  sbb	r2, r2		C 3 10 *
  add	%rax, t0	C 4 10 *
  mov	t1, %rax	C 4 10 *
  adc	%rdx, %rax	C 5 11
  sbb	$0, r2	 	C 6 12
  mov	t0, t1		C 7

So this should run in 6 cycles if we're lucky (and like the other code,
there are three independent instructions scheduled in the same cycle,
marked with * above. It would be fine to postpone the first of them, but
the out-of-order machinery won't know that, I'm afraid).

If we let B^2 = m d + b2 (m one limb and a bit) and B^3 = m2 d + b3
(m2 two limbs and a bit) the corresponding q piece is

  r2 m2 + r1 m

which is one multiply, one conditional add of the two-limb number m3,
and handling of the most significant implicit bits.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.