div_qr_1 interface

Niels Möller nisse at lysator.liu.se
Mon Oct 21 06:37:03 CEST 2013


Torbjorn Granlund <tg at gmplib.org> writes:

> Have you analysed the register needs?  Pushing all callee-saves
> registers is quite expensive.

Per the FIXME-comment, we could avoid saving them for the n == 2 case
(which I think corresponds corresponds to n == 3 for the mpn_div_qr_1
caller, so it might help that regression), but we do need a lot of
registers for the actual loop.

> For the mul insn, it is usually better to copy the invariant/noncritical
> operand to rax, and use the critical operand explicitly in the mul insn.

Will try that. I think one could also try to delay the quotient store
one iteration, keeping "Q1" in a register until the next iteration. Then
one gets rid of the

	adc	Q2,8(QP, UN, 8)

in the loop, using only a single store per iteration in the likely case.
May need yet another register, though.

> I suspect one or two of the register-to-register copy insns could be
> optimised out.

Maybe. And it would be easier to avoid moves if one unrolls the loop
twice, switching roles U0<->U1 and Q0<->Q1. But that makes it a bit more
bloated, of course.

> In order to run this through the loopmixer, you need to setup data in
> the prologue which makes the adjustment branch to never be taken.
> Letting the inverse be 0 or else B-1 might work...

I vaguely recall some previous attempt at loopmixing this, but I don't
remember any success.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.


More information about the gmp-devel mailing list