fac_ui rewrote.

bodrato at mail.dm.unipi.it bodrato at mail.dm.unipi.it
Wed Jan 18 08:21:12 CET 2012

Ciao,

Il Mar, 17 Gennaio 2012 4:50 pm, Torbjorn Granlund ha scritto:
> Considering that the fac_ui binary is about 4000 bytes, a 1k table is

Consider that it contains the sieve (which should be unified with the
other implementation in the library and removed from here), the swing
factorial (that can be reused by double-factorial, and _dc_bin_ui)... this
means that a 1k table for the factorial only is not that small :-)

>   Another possible strategy would be to add another intermediate
algorithm for the oddfactorial:
>    - we can compute: odd n in ]50..100] times (odd n in ]25..50])^2,
times (odd n in [13..25])^3 ...

> This is a clever idea.  I am not sure how to organise the actual
computation, assuming this is (semi-) basecase code.

The clever idea is not mine. I think it is not really a basecase, it
should be quite efficient also for large computations. I'll try.

> The division in the basecase code is probably worse, in terms of
percentual cost.

I replaced the divisions in the basecase with setting a constant...

>   My tests with binom code used a function:
>   mpz_prodd_1_1 (mpz_ptr x, mp_limb_t l, mp_limb_t h)
...
> I suppose here the mulN trick is applicable.

Absolutely. The odd-mulN trick can be applied to: binomial, factorial,
double-factorial (basecases).
The trickier prime-mulN trick can be used for: primorial, swing-factorial
(i.e. prime-based [double]factorials), prime-based binomial.

>   I'm very interested in your opinion about the mpz_prodlimbs(X, F[], J)
...
> I am not sure I understand this alternative method.  Are you suggesting
that one starts multiplyng limbs, forming some number of TOOM22_THRS sized
chunks, and then start multiplying these chunks recursively?

Exactly.

This also has a non-immediately obvious advantage: forming chunks can be
done in-place, after this phase we can have a better estimate of the final
size, and of the scratch we need.

>   Should we measure a new threshold for this block size?
>   MUL_TOOM22_THRESHOLD is based on balanced products, while here we
>   should compare balanced with mul_1...
>
> Good question.  Using a fudge factor computed from the limb size and the
limb count for when this entire algorithm is used, might help a little.

With the "alternative method" we don't need such a factor.
Currently a fixed "random" one is used:

#define REC_PROD_THRESHOLD (MUL_TOOM22_THRESHOLD|(MUL_TOOM22_THRESHOLD>>1))

i.e. TOOM22 <= PROD <= 1.5 * TOOM22

But this value is probably too small. Splitting a block smaller than
2*TOOM22 is probably useless: we only obtain a balanced basecase product.

--
http://bodrato.it/software/combinatorics.html