div_qr_1 interface
Niels Möller
nisse at lysator.liu.se
Mon Oct 21 06:37:03 CEST 2013
Torbjorn Granlund <tg at gmplib.org> writes:
> Have you analysed the register needs? Pushing all callee-saves
> registers is quite expensive.
Per the FIXME-comment, we could avoid saving them for the n == 2 case
(which I think corresponds corresponds to n == 3 for the mpn_div_qr_1
caller, so it might help that regression), but we do need a lot of
registers for the actual loop.
> For the mul insn, it is usually better to copy the invariant/noncritical
> operand to rax, and use the critical operand explicitly in the mul insn.
Will try that. I think one could also try to delay the quotient store
one iteration, keeping "Q1" in a register until the next iteration. Then
one gets rid of the
adc Q2,8(QP, UN, 8)
in the loop, using only a single store per iteration in the likely case.
May need yet another register, though.
> I suspect one or two of the register-to-register copy insns could be
> optimised out.
Maybe. And it would be easier to avoid moves if one unrolls the loop
twice, switching roles U0<->U1 and Q0<->Q1. But that makes it a bit more
bloated, of course.
> In order to run this through the loopmixer, you need to setup data in
> the prologue which makes the adjustment branch to never be taken.
> Letting the inverse be 0 or else B-1 might work...
I vaguely recall some previous attempt at loopmixing this, but I don't
remember any success.
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list