# Wraparound multiplicaton vs mullo

Torbjorn Granlund tg at gmplib.org
Sat Oct 17 21:59:49 CEST 2009

```nisse at lysator.liu.se (Niels Möller) writes:

nisse at lysator.liu.se (Niels "=?iso-8859-1?Q?M=F6ller?=") writes:

> Lets see what this means in Schoolbook, Karatsuba and Toom ranges,
> ignoring the O(n) term:
>
>         Schoolbook  Karatsuba     Toom-3     Toom-4
> exponent         2       1.59       1.46       1.40
>
> x^2 - 1  0.5  M(n)  0.66 M(n)  0.72 M(n)  0.76 M(n)
> x^4 - 1  0.31 M(n)  0.56 M(n)  0.65 M(n)  0.71 M(n)

I've tried the x^2 - 1 algorithm now, and benchmarked on an AMD
x86_32. A first implementation is attached.

It's faster than full multiplication, mpn_mul_n, for n >= 20, and
faster than mpn_mullow_n for n >= 24 (current mpn_mullow_n uses
quadratic algorithms in this range, while for larger sizes it uses the
obvious divide-and-conquer strategy, not Mulders' enhancement).

I've benchmarked up to n = 500, and in the range 40 <= n <= 500, it
consistently saves approximately 25% compared to mpn_mul_n. Since this
is mostly toom3 and toom4 range, it's quite close to the above
analysis.

Exellent!

/* mul_2nm1.c
*
* Experimental code for multiplication mod 2^n-1.
*/

Should this be B^n-1?

/* Inputs and outputs are n limbs, computation is mod B^n - 1, and
* values are seminormalized; zero is represented as either 0 or B^n -
* 1. Needs 2n limbs at rp. */
static void
mul_2nm1_ref (mp_ptr rp, mp_srcptr ap, mp_srcptr bp, mp_size_t n)
{
mp_limb_t cy;
mpn_mul_n (rp, ap, bp, n);
cy = mpn_add_n (rp, rp, rp+n, n);
/* If cy == 1, then the value of rp is at most B^n - 2, so there can
* be no overflow when adding in the carry. */
MPN_INCR_U (rp, n, cy);
}

Do you represent 0 canonically as B^n-1?
(I suppose one could preincrement, then decrement if not carry,

mul_2nm1_2 (mp_ptr rp, mp_srcptr ap, mp_srcptr bp, mp_size_t size,

2nm1 is for 2^n minus 1?  Again, isn't B the correct base?
What is the _2 for?