fac_ui rewrote.

bodrato at mail.dm.unipi.it bodrato at mail.dm.unipi.it
Mon Jan 16 17:13:35 CET 2012


Il Dom, 15 Gennaio 2012 3:21 pm, Torbjorn Granlund ha scritto:
> I have now read the new factorial code looking for possible further
> improvements.


> For the limb accumulation, the code uses this macro:
> #define FACTOR_LIST_STORE(P, PR, MAX_PR, VEC, I)                \
> I think this can be improved along two possible lines.
> 1. Use an odd-factor variant of the mulN functions (introduced in my
>    binomial code, see https://gmplib.org/devel/).

> 2. Table pre-accumulated odd-factor products up to some limit n.  I

In the current code, the macro is used in three different contexts:

- LOOP_ON_SIEVE_BEGIN (prime,...); FACTOR_LIST_STORE (prime,...);
 to compute the product of a given interval of PRIMES;
- do { FACTOR_LIST_STORE (i, ...); i += 2;} while (i <= tn);
 to compute the product of a given interval of ODD-NUMBERS;
- do { FACTOR_LIST_STORE (i, ...);} while (++i <= n);
 for the very basecase computation of n! (disabled on many archs, where
FAC_ODD_THRESHOLD=0): product of a given interval of NATURAL-NUMBERS;

Each one would need a slightly different optimisation strategy... But you
are right, when you suggest to act on the "odd-factor" context, it is the
one where a speed-up can give an improvement on performances.

The full code I developed, containing also some code for the binomial,
double factorial, primorial and so on... reuse the same macro in similar
contexts or in some new ones...

> For a 64-bit machins, using a 1000-byte table, we get 125 limb entries,

Should we add such a big table only for the factorial?
Another possible strategy would be to add another intermediate algorithm
for the oddfactorial:
 - instead of computing product of odd n in [3..100] times odd n in
[3..50] times odd n in [3..25] ...
 - we can compute: odd n in ]50..100] times (odd n in ]25..50])^2, times
(odd n in [13..25])^3 ...
This would probably add only a 1000-byte-of-code function... but I'm

> Saving the non-constant division for computing max_prod (and its name
> variants) is also important.

Yes, an approximation should be enough... In particular the continuous
update in the loop
  for (; ...; tn >>= 1) {
    max_prod = GMP_NUMB_MAX / tn;
is not clever... updating tn >>= 1; max_prod <<= 1; should be enough.

I'll change it.

> (Should my binom code also work with odd factors?  I haven't thought
about that.)

My tests with binom code used a function:
mpz_prodd_1_1 (mpz_ptr x, mp_limb_t l, mp_limb_t h)

The core of it is:

  do {
    max_prod = GMP_NUMB_MAX / h;
    i = l | 1;
    do {
      FACTOR_LIST_STORE (i, prod, max_prod, factors, j);
      i += 2;
    } while (i <= h);
    h >>= 1;
    l = (l + 1) >> 1;
  } while (h > l);

... this (stores the factors to) compute the odd component of the product
of the natural numbers in [l..h] (the last factor h>>ctz(h) is handled
separately). Again it would benefit of a clever optimisation for the
product of consecutive odd numbers...

My mpz_dc_bin_ui uses basically the same loop (with both l and h mpz_t)
for its basecase...

I'm very interested in your opinion about the mpz_prodlimbs(X, F[], J)
function. It is used to compute in X the product of the J limbs contained
in the array F (the array is overwritten, but it can not overlap with X).

The strategy of the current code is: split the list in two parts
recursively, below a given threshold (related to TOOM22) loop on mul_1,
multiply the two parts.
It might be better to use mul_1 at first, to accumulate the factors up-to
a given block size (again TOOM22_THRS) using as many limbs as needed (we
do not know "a priori", because it depends on how "full" the limbs are),
then use the "split in two parts" strategy with blocks.

Should we measure a new threshold for this block size?
MUL_TOOM22_THRESHOLD is based on balanced products, while here we should
compare balanced with mul_1...

Let me know.



More information about the gmp-devel mailing list