mpn_mulmod_bnm1

Thu Apr 3 07:36:44 UTC 2014

On 03/04/2014, at 4:53 PM, Niels Möller <nisse at lysator.liu.se> wrote:

>> I propose using a modulus of the form
>> 
>> M = (B^{k u_1} + 1) (B^{k u_2 + 1) ... (B^{k u_s} + 1),
>> 
>> where
>> 
>> B = 2^64
>> k is some integer
>> u_1 > u_2 > ... > u_s is a decreasing sequence of powers of 2.
> 
> Seems closely related to variants of the TFT which use polynomial
> 
>  f(x) = (x^{u_1}+1) (x^{u_2} + 1) ... (x^{u_s} + 1)
> 
> where u_j = 2^{k_j} are distinct powers of two and f(x) is a factor of
> the larger x^{2^{2u_1} - 1. Which is a TFT, since we evaluate the input
> polynomial (or reduced variants thereof) at the roots of f, and the
> roots of f are a subset of the roots of x^{2^n} - 1 (and if needed, one
> can always go between x^{2^k} - 1 and x^{2^k} + 1 by a simple change of
> variable).
> 
> I've seen the details spelled out in some paper on in-place TFT
> transforms.

Yes, I have seen this paper, but I can't remember the title or author.... when I ran into it, I remember thinking that I wish I had gotten around to writing down this idea first :-)

>> REDUCE: takes as input an arbitrary integer, and computes its residues
>> modulo B^{k u_1} + 1, ..., B^{k u_s} + 1. [...] It ends up looking
>> very similar to the top layers of the TFT algorithm, working on the
>> input in chunks of k limbs.
> 
> I'd guess it really *is* a TFT algorithm...
> 
> And the cool thing, compared to the plain TFT, is that the subset of
> roots used really correspond to a useful integer modulo. (*Maybe* that's
> in some way the case also with the plain TFT, but I'm afraid that the
> subset of roots used in my code corresponds to a factorization of
> x^{2^n}-1 which can't be lifted up from Z_p[x] to Z[x]. I haven't
> thought this through carefully).

Right.

It is possible to do everything using the polynomial f(x) you suggest above (in this case the factorisation does lift to Z[x]), and I think this may have been what I tried first.

But it has a few disadvantages:

(1) the bit bounds for the coefficients get worse. For example if u = 5 then f(x) = x^5 + x^4 + x + 1, so when you compute say a(x) b(x) mod f(x) (in Z[x]), every coefficient in the result is a sum of up to *five* terms from a(x) b(x). If say u = 63, you need an extra 6 bits. When you only work with x^n + 1, you never need extra bits like this.

(2) I haven't checked this carefully, but I believe my method is faster even in theory, mainly because some of the work is transferred out of the TFT into plain integer arithmetic. During the evaluation phase, you replace roughly one layer of TFT (and notice this is the "hard" part of the butterfly, because you are always working with the x^n + 1 half rather than the x^n - 1 half) with corresponding mpn_add/subs. The adds/subs operate on integers before they have been split into coefficients, so there's roughly half as much data to process, and you are only doing adds/subs instead of modmuls. The same thing applies during the interpolation phase; you are doing the butterfly work after the recomposition has already happened, so again only half as much data involved.

(3) You probably need more specialised assembly code to handle the unusual low-level TFT primitives. (I haven't thought through this carefully.)

david