Small-base modexp

Marco Bodrato bodrato at mail.dm.unipi.it
Wed Oct 2 23:40:46 UTC 2019

```Ciao,

Il Mer, 2 Ottobre 2019 3:29 pm, Torbjörn Granlund ha scritto:
> "Marco Bodrato" <bodrato at mail.dm.unipi.it> writes:

>   I wrote the doubling-modulo step with 3 possible operations:
>    if it can not overflow, simply lshift;
>    otherwise choose between (x*2-m) and ((x-m)*2)
>
> Do you choose that at each step?  That might be a cause for delay due to
> branch miss.

Yes, at each step.

I implemented it with two variants of that portion:

#ifdef BRANCHLESS
int cond1 = (r0 >= m0);
int cond2 = (r0 > (m0 >> 1));

r0 -= m0 & - (mp_limb_t) cond1;
r0 <<= 1;
r0 -= m0 & - (mp_limb_t) (cond1 ^ cond2);
#else
if (r0 <= (m0 >> 1))
r0 <<= 1;
else if (r0 >= m0)
r0 = (r0 - m0) << 1;
else
r0 = (r0 << 1) - m0;
#endif

I did not measure them very carefully, but the BRANCHLESS variant seems
slower.

> Let me make sure I understand what you're comparing.
>
> Are you comparing mpz_powm with base = 2 with your mew special code for
> base = 2 for size = 1?

The current mpz_powm code has (basically) the same speed for any base. 2,
3, 0xffffffffff or whatever.

I added a special function, that is called by mpz_powm when the base is 2.
When size=1 this function uses an ad-hoc loop directly operating on limbs,
for the other sizes the special function uses functions from the mpn_
layer.

>   overhead 0.000000002 secs, precision 10000 units of 2.86e-10 secs, CPU
>   freq 3500.08 MHz
>                mpz_powm    mpz_powm.2    mpz_powm.3
>   1         0.000004322       #0.8082        0.9884
>   2         0.000007201        1.0723       #0.9836
>   3         0.000013274       #0.9465        0.9928

>   56        0.001738323       #0.7941        1.0011

The current code, with a small base (eg. 3) is basically as fast as with a
random base (maybe 1% difference, maybe not).

For size=1, my specialised function with base=2 is 20% faster.
For small sizes (eg.3), my specialised function (base=2) is slightly
faster (5%).
For larger sizes (eg.56), my specialised function (base=2) is sensibly
faster (20%).

For size=2, my specialised function (base=2) is slower, at least on some
architectures.

> I'd expect to see significant speedup for slower functions like gcd,
> powm and jacobi from specialisation.

powm may have different specialisations.

With respect to size (if we write a specialised powm function for a
generic base and a modulus of size 2 limbs) then my function may loose
even harder for that size...

> than is possible to get back.  For for slower functions it is worth it.

With the new composite test, a specialisation for base=2 makes sense...

Ĝis,
m

--
http://bodrato.it/papers/

```