div_qr_1n_pi1
Torbjörn Granlund
tg at gmplib.org
Sun Jun 6 20:16:27 UTC 2021
nisse at lysator.liu.se (Niels Möller) writes:
Maybe we should have some macrology for that? Or do all relevant
processors and compilers support efficient cmov these days? I'm sticking
to masking expressions for now.
Let's not trust results from compiler generated code for these things.
The mixture of inline asm and plain code is hard for compilers to deal
with. Very subtle things can make a huge cycle count difference.
For conclusive results, asm is needed, unfortunately.
(That's not always the case; Marco and I have played with AVX3/AVX512
lately with both asm and C using intrinsics. C behaved well there. but
that wss for non- arithmetic loops.)
So what about cmov's performance? Intel fixed its latency for their
high-end cores with broadwell, which is about 6 years ago. Their
low-power cores still have 2 cycles though. AMD's cores always have had
1 cycle latency and good throughput.
Worries about side-channel leakage of cmov isn't so relevant for these
particular functions, since the use of MPN_INCR_U is a data dependent
loop anyway.
Granted.
--
Torbjörn
Please encrypt, key id 0xC8601622
More information about the gmp-devel
mailing list