div_qr_1n_pi1

Sun Jun 6 20:16:27 UTC 2021

nisse at lysator.liu.se (Niels Möller) writes:

  Maybe we should have some macrology for that? Or do all relevant
  processors and compilers support efficient cmov these days? I'm sticking
  to masking expressions for now.

Let's not trust results from compiler generated code for these things.
The mixture of inline asm and plain code is hard for compilers to deal
with.  Very subtle things can make a huge cycle count difference.

For conclusive results, asm is needed, unfortunately.

(That's not always the case; Marco and I have played with AVX3/AVX512
lately with both asm and C using intrinsics.  C behaved well there.  but
that wss for non- arithmetic loops.)

So what about cmov's performance?  Intel fixed its latency for their
high-end cores with broadwell, which is about 6 years ago.  Their
low-power cores still have 2 cycles though.  AMD's cores always have had
1 cycle latency and good throughput.

  Worries about side-channel leakage of cmov isn't so relevant for these
  particular functions, since the use of MPN_INCR_U is a data dependent
  loop anyway.

Granted.

-- 
Torbjörn
Please encrypt, key id 0xC8601622