bodrato at mail.dm.unipi.it
Thu Oct 3 19:10:43 UTC 2019
Il Gio, 3 Ottobre 2019 8:39 am, Niels Möller ha scritto:
> "Marco Bodrato" <bodrato at mail.dm.unipi.it> writes:
>> I implemented it with two variants of that portion:
>> #ifdef BRANCHLESS
>> int cond1 = (r0 >= m0);
>> int cond2 = (r0 > (m0 >> 1));
>> r0 -= m0 & - (mp_limb_t) cond1;
>> r0 <<= 1;
>> r0 -= m0 & - (mp_limb_t) (cond1 ^ cond2);
>> if (r0 <= (m0 >> 1))
>> r0 <<= 1;
>> else if (r0 >= m0)
>> r0 = (r0 - m0) << 1;
>> r0 = (r0 << 1) - m0;
> What's the range of r0? To me it looks like the last two cases could
> overflow if r0 is much larger than m0, so I take it there's some limit?
r0 comes from sqr+redc.
So, if I understood correctly, it has at most as many bits as m0, but may
be r0 >= m0. Since the number of bits is not larger, we can not have r0 >=
2 * m0, and the formula (r0 - m0) << 1 can not overflow.
> If r0 <= m0, one could the modular-add-via-sub trick, which reduces the
> number of conditions from two to one:
Ok, I agree, if cond1 = (r0 >= m0) is always false, then we can get rid of
Anyway, I was simply surprised by the fact that a code with two branches
seems faster than the same idea reimplemented in a branchless-way, with
masks. Maybe because the compiler translate the branchy code in assembler
using cmovs, and obtain a more efficient branchless object code than the
code with masks? I will check...
> But if m0 has the most significant bit set, one would need some other
> trick. If you don't find something clever enough, the easiest way to
> organize it may be to work with canonical representation, and
For the canonical representation we need a canonical modular reduction.
I'm not sure it could be faster than the current code, at least for large
enough exponents (let's say 50-60 bits).
More information about the gmp-devel