div_qr_1n_pi1

Fri Jul 9 12:17:42 UTC 2021

nisse at lysator.liu.se (Niels Möller) writes:

  Same as in the current (from 2013) version. Delaying the write is a bit
  tricky, since we already use all registers. But it would be better to
  update the quotient limbs in memory only in the unlikely
  carry-propagation case. I figure adc to memory is no worse than explicit
  load, adc, store (or adc from memory, store)?

Which is worse depends on CPU and magic.

I did not realise that the register pressure was so bad.  Perhaps trying
to decrease that would be most helpful.  Sometimes, when values tend to
naturally migrate, some unrolling can reduce register pressure.

When I struggle with register pressure, I usually annotate the code with
what regs are live and where each dies.  That then can steer pressure
reducing transformations.

If requiring mulx helps, I would for now forget about mul.  All relevant
CPUs have mulx.

  I could try moving the dec. I often try to insert independent
  instructions between depending ones, but perhaps that's bad in this case
  (and generally not very helpful on processors with powerful out-of-order
  capabilities).

Insn fusion happens only with branches (and perhaps cmov) and iirc only
if the insn are adjacent.

  Does that mean that 1.5 (2100 / 1400) is the right factor? Then it's
  more like 11.0 c/l vs 11.5.

You could explore this by testing some plain loop, e.g.

	.text
	.globl main
main:	mov $ASSUMED_FREQUENCY, %rax
1:	dec %rdx
	dec %rdx
	dec %rdx
	dec %rdx
	dec %rdx
	dec %rdx
	dec %rdx
	dec %rdx
	dec %rdx
	dec %rdx
        dec %rax
        jnz 1b
	ret

If ASSUMED_FREQUENCY is right, it should take 10 seconds.  Else, I leave
it as an exercise to compute the actual frequency.

-- 
Torbjörn
Please encrypt, key id 0xC8601622