# Alternative div_qr_1

Niels Möller nisse at lysator.liu.se
Tue May 11 14:38:40 CEST 2010

```Torbjorn Granlund <tg at gmplib.org> writes:

> You actually wrote an x86_64 assembly mod_1_1 some time ago that you
> claimed ran in 6 c/l.

It took me some time to dig it up, but I found this in an email I wrote
Aug 8 last year.

mul	b2		C 0, 6
and	b3, r2		C 1
mov	u[j], t0	C
mov	r0, t1		C
sbb	r2, r2		C 4
sbb	\$0, r2	 	C 6
mov	t0, r0
mov	t1, %rax

The algorithm is really simple, the iteration is

<r2, r1, r0>  <-- <r0, u[j]> + r1 b2 + r2 b3

where b2 = B^2 (mod d), b3 = B^3 (mod d). r2 is one-bit number,
represented so that r2 b3 is an and operation, and the right hand side
is less than B^2 + B d.

Expessing the other algorithm in similar form, it was

<r2, r1, r0>  <-- <r0, u[j]> + r1 b2 + r2 b2 (B - d)

We see that it differs only in the constant coefficient for r2, and
clearly b2 (B - d) = b3 (mod d).

The assembler code above needs some rearrangement to get the mov t1,
%rax off the recurrency chain. This variant might work better:

mul	b2		C 0  6
and	b3, r2		C 0  7
mov	u[j], t0	C
add	r2, t0		C 1  8
adc	\$0, t1		C 2  9
sbb	r2, r2		C 3 10 *
add	%rax, t0	C 4 10 *
mov	t1, %rax	C 4 10 *
adc	%rdx, %rax	C 5 11
sbb	\$0, r2	 	C 6 12
mov	t0, t1		C 7

So this should run in 6 cycles if we're lucky (and like the other code,
there are three independent instructions scheduled in the same cycle,
marked with * above. It would be fine to postpone the first of them, but
the out-of-order machinery won't know that, I'm afraid).

If we let B^2 = m d + b2 (m one limb and a bit) and B^3 = m2 d + b3
(m2 two limbs and a bit) the corresponding q piece is

r2 m2 + r1 m

which is one multiply, one conditional add of the two-limb number m3,
and handling of the most significant implicit bits.

Regards,
/Niels

--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
```