# Cancellation with hgcd / unbalanced mulmod_bnm1

bodrato at mail.dm.unipi.it bodrato at mail.dm.unipi.it
Sat Nov 12 11:09:32 CET 2011

```Ciao Niels!

Il Ven, 11 Novembre 2011 4:21 pm, Niels ha scritto:
> The input sizes to mpn_mulmod_bnm1 is then 3n x n. How well suited is
> current mpn_mulmod_bnm1 to these unbalanced operands?

The cost of current mpn_mulmod_bnm1 implementation does not really depend
on how much balanced the operands are.
It computes two products: mod m'=B^{k/2}+1, and mod m"=B^{k/2}-1. In your
scenario, it will further reduce a, resulting in two 3/2 n x n mulmods; it
will detect that b is short and will not reduce it, this is the only
saving related to unbalancement.

> Another way is to split a as a = a_3 B^{3n} + a_2 B^{2n} + a_1 B^n +
[...]
>   (a_1 B^n + a_0) * b  // mpn_mul, 2n x n
> + B^2 (a_2 b mod B^n)  // mpn_mullo, n x n

> Another option is to use the modulo m = B^{3n} - B^{2n}, and do the
[...]
>   (a_1 B^n + a_0) * b        // mpn_mul, 2n x n
> + B^2 (a_2 b mod (B^n - 1))  // mpn_mulmod_bnm1, n x n

> Will that be better or worse than an unbalanced 3n x n mpn_mulmod_bnm1?
> In the fft range, it *should* be worse, since the mpn_mul will be done
> as an mpn_mulmod_bnm1 of size 3n. But what about smaller sizes?

Exactly, in the FFT range the cost of the mpn_mul 2n x n and the cost of
the mpn_mulmod_bnm1 3n x n are the same. Near the threshold the difference
is small, i.e. probably smaller than the time required by mullo.

There is a similar choice in invertappr, where you need only 2n limbs from
a 2n x n product. The threshold in this case is INV_MULMOD_BNM1_THRESHOLD.
It varies from 20 to 100 limbs. But I didn't try the mul(nxn)+B*mullo(nxn)
trick.

> Any other tricks?

Directly using FFT modulo m = B^{3n} + 1 should be slower than mulmod_bnm1...

Regards,
m

--
http://bodrato.it/papers/

```