Torbjörn Granlund tg at gmplib.org
Thu Jul 8 21:26:37 UTC 2021

I think you should delay writing through QP to avoid adc to a memory
place, and have just one plain write through QP per iteration.

The dec UN and the branch might run faster if put adjacent to each
other, as many CPUs fuse these into a single instruction.

Your cycle numbers should proably be multiplied by a factor

  ("turbo" frequency) / (nominal frequency)

as 7.x c/l seems faster than we ever measured.

