div_qr_1n_pi1

Thu Jul 8 21:26:37 UTC 2021

I think you should delay writing through QP to avoid adc to a memory
place, and have just one plain write through QP per iteration.

The dec UN and the branch might run faster if put adjacent to each
other, as many CPUs fuse these into a single instruction.

Your cycle numbers should proably be multiplied by a factor

  ("turbo" frequency) / (nominal frequency)

as 7.x c/l seems faster than we ever measured.

-- 
Torbjörn
Please encrypt, key id 0xC8601622