div_qr_1n_pi1
Torbjörn Granlund
tg at gmplib.org
Fri Jul 9 12:17:42 UTC 2021
nisse at lysator.liu.se (Niels Möller) writes:
Same as in the current (from 2013) version. Delaying the write is a bit
tricky, since we already use all registers. But it would be better to
update the quotient limbs in memory only in the unlikely
carry-propagation case. I figure adc to memory is no worse than explicit
load, adc, store (or adc from memory, store)?
Which is worse depends on CPU and magic.
I did not realise that the register pressure was so bad. Perhaps trying
to decrease that would be most helpful. Sometimes, when values tend to
naturally migrate, some unrolling can reduce register pressure.
When I struggle with register pressure, I usually annotate the code with
what regs are live and where each dies. That then can steer pressure
reducing transformations.
If requiring mulx helps, I would for now forget about mul. All relevant
CPUs have mulx.
I could try moving the dec. I often try to insert independent
instructions between depending ones, but perhaps that's bad in this case
(and generally not very helpful on processors with powerful out-of-order
capabilities).
Insn fusion happens only with branches (and perhaps cmov) and iirc only
if the insn are adjacent.
Does that mean that 1.5 (2100 / 1400) is the right factor? Then it's
more like 11.0 c/l vs 11.5.
You could explore this by testing some plain loop, e.g.
.text
.globl main
main: mov $ASSUMED_FREQUENCY, %rax
1: dec %rdx
dec %rdx
dec %rdx
dec %rdx
dec %rdx
dec %rdx
dec %rdx
dec %rdx
dec %rdx
dec %rdx
dec %rax
jnz 1b
ret
If ASSUMED_FREQUENCY is right, it should take 10 seconds. Else, I leave
it as an exercise to compute the actual frequency.
--
Torbjörn
Please encrypt, key id 0xC8601622
More information about the gmp-devel
mailing list