Small-base modexp

Thu Oct 3 19:10:43 UTC 2019

Ciao,

Il Gio, 3 Ottobre 2019 8:39 am, Niels Möller ha scritto:
> "Marco Bodrato" <bodrato at mail.dm.unipi.it> writes:

>> I implemented it with two variants of that portion:
>>
>> #ifdef BRANCHLESS
>>               int cond1 = (r0 >= m0);
>>               int cond2 = (r0 > (m0 >> 1));
>>
>>               r0 -= m0 & - (mp_limb_t) cond1;
>>               r0 <<= 1;
>>               r0 -= m0 & - (mp_limb_t) (cond1 ^ cond2);
>> #else
>>               if (r0 <= (m0 >> 1))
>>                 r0 <<= 1;
>>               else if (r0 >= m0)
>>                 r0 = (r0 - m0) << 1;
>>               else
>>                 r0 = (r0 << 1) - m0;
>> #endif
>
> What's the range of r0? To me it looks like the last two cases could
> overflow if r0 is much larger than m0, so I take it there's some limit?

r0 comes from sqr+redc.
So, if I understood correctly, it has at most as many bits as m0, but may
be r0 >= m0. Since the number of bits is not larger, we can not have r0 >=
2 * m0, and the formula (r0 - m0) << 1 can not overflow.

> If r0 <= m0, one could the modular-add-via-sub trick, which reduces the
> number of conditions from two to one:

Ok, I agree, if cond1 = (r0 >= m0) is always false, then we can get rid of
it :-D

Anyway, I was simply surprised by the fact that a code with two branches
seems faster than the same idea reimplemented in a branchless-way, with
masks. Maybe because the compiler translate the branchy code in assembler
using cmovs, and obtain a more efficient branchless object code than the
code with masks? I will check...

> But if m0 has the most significant bit set, one would need some other
> trick. If you don't find something clever enough, the easiest way to
> organize it may be to work with canonical representation, and

For the canonical representation we need a canonical modular reduction.
I'm not sure it could be faster than the current code, at least for large
enough exponents (let's say 50-60 bits).

Ĝis,
m

-- 
http://bodrato.it/papers/