div_qr_1n_pi1
Niels Möller
nisse at lysator.liu.se
Fri Jul 9 07:24:38 UTC 2021
Torbjörn Granlund <tg at gmplib.org> writes:
> I think you should delay writing through QP to avoid adc to a memory
> place, and have just one plain write through QP per iteration.
>
> The dec UN and the branch might run faster if put adjacent to each
> other, as many CPUs fuse these into a single instruction.
Same as in the current (from 2013) version. Delaying the write is a bit
tricky, since we already use all registers. But it would be better to
update the quotient limbs in memory only in the unlikely
carry-propagation case. I figure adc to memory is no worse than explicit
load, adc, store (or adc from memory, store)?
I could try moving the dec. I often try to insert independent
instructions between depending ones, but perhaps that's bad in this case
(and generally not very helpful on processors with powerful out-of-order
capabilities).
> Your cycle numbers should proably be multiplied by a factor
>
> ("turbo" frequency) / (nominal frequency)
>
> as 7.x c/l seems faster than we ever measured.
lscpu says
Model name: AMD Ryzen 5 PRO 4650U with Radeon Graphics
Stepping: 1
Frequency boost: enabled
CPU MHz: 1397.125
CPU max MHz: 2100.0000
CPU min MHz: 1400.0000
Does that mean that 1.5 (2100 / 1400) is the right factor? Then it's
more like 11.0 c/l vs 11.5.
At least benchmark numbers are a lot more consistent between runs on
this machine, than they were on my previous laptop.
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list