div_qr_1n_pi1

Fri Jul 9 07:24:38 UTC 2021

Torbjörn Granlund <tg at gmplib.org> writes:

> I think you should delay writing through QP to avoid adc to a memory
> place, and have just one plain write through QP per iteration.
>
> The dec UN and the branch might run faster if put adjacent to each
> other, as many CPUs fuse these into a single instruction.

Same as in the current (from 2013) version. Delaying the write is a bit
tricky, since we already use all registers. But it would be better to
update the quotient limbs in memory only in the unlikely
carry-propagation case. I figure adc to memory is no worse than explicit
load, adc, store (or adc from memory, store)?

I could try moving the dec. I often try to insert independent
instructions between depending ones, but perhaps that's bad in this case
(and generally not very helpful on processors with powerful out-of-order
capabilities).

> Your cycle numbers should proably be multiplied by a factor
>
>   ("turbo" frequency) / (nominal frequency)
>
> as 7.x c/l seems faster than we ever measured.

lscpu says

Model name:                      AMD Ryzen 5 PRO 4650U with Radeon Graphics
Stepping:                        1
Frequency boost:                 enabled
CPU MHz:                         1397.125
CPU max MHz:                     2100.0000
CPU min MHz:                     1400.0000

Does that mean that 1.5 (2100 / 1400) is the right factor? Then it's
more like 11.0 c/l vs 11.5.

At least benchmark numbers are a lot more consistent between runs on
this machine, than they were on my previous laptop.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.