Alternative div_qr_1

Niels Möller nisse at
Wed Jun 16 21:48:49 CEST 2010

Torbjorn Granlund <tg at> writes:

> About the inner loop, couldn't 'sub r2,r2' + 'and B2modb,r2' be replaced
> by 'mov $0,r2' + 'cmovc B2modb,r2' for a slightly more shallow path (but
> perhaps not the critical path)?

That would work, but it makes the processing after the loop more
complicated, since it needs to do a different conditional operation
depending on r2. I don't think it's critical, since the latency for the r1
recurrency is 6 cycles (mul, add %rax, adc %rdx)

> It seems r2 and t0 could share register, if 'mov t0, r0' were moved
> somewhat earlier. 

Nice catch. 

>  But perhaps that messes up the pipeline?

I had to add a nop before the loop mixer found a new 6 cycle sequence,
but that should work, and need one register less.

	lea    (B2md, r0), t1
	and    B2, t0
	mul    B2
	add    t0, r0
	mov    (up, un, 8), t0
	cmovc  t1, r0
	add    %rax, t0
	mov    r0, %rax
	adc    %rdx, %rax
	mov    t0, r0
	sbb    t0, t0
	sub    $1, un
	jnc    loop_begin

(not actually tested, though).

I also wonder if the current way of handling an unnormalized divisor is
the best. The loop reduces things modulo d, ending up with {r2, r1, r0},
two words plus a bit, which is equal to the input mod d.

The final processing reduces this to {r1, r0} < d B using the
precomputed B1 = B mod d. This is the only use of B1 in the function,
and using B1 is chepar (one multiplication less and less other logic)
than dividing using dinv). And then {r1, r0} and d are both left shifted
before multiplying r1 * dinv.

Alternatively, one could try to normalize the input u on the fly. Let's
assume this can be done and still run at 6 cycles in the inner loop
(it's not on the critical path, and we have four unused decoder slots).
Then we eliminate the multiplication by B1 in the final processing, but
we also make the input one limb longer, so the total number of
multiplications is unchanged. But it has the advantage that B1 need not
be computed at all.

I also have a question and a comment on the cps functions in general:

1. What's "cps" supposed to mean?

2. I think it would be a better interface to also include d (or any
   variant of it needed) in the pre array, and *not* pass the
   (normalized) divisor as an argument to the main mod_1_k functions.
   That would give the cps function the freedom to choose if d or
   normalized d or negated d or some other variant is most useful.

   For my current mod_1_1p function, and the current cps function, it
   works like this: The cps functions computes normalized divisor d <<
   count, in order to invert it. Next, the caller, who doesn't have
   access to this value, recomputes d << pre[1] and passes this as an
   argument to mod_1_1p. Finally, mod_1_1p has no use for the normalized
   d until it gets to the final processing, so it right shifts the passed in
   normalized d, undoing the caller's shift. This is clearly not the
   best way to do it.

Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

More information about the gmp-devel mailing list