fac_ui rewrote.

Tue Jan 17 16:50:01 CET 2012

bodrato at mail.dm.unipi.it writes:

  In the current code, the macro is used in three different contexts:

  - LOOP_ON_SIEVE_BEGIN (prime,...); FACTOR_LIST_STORE (prime,...);
   to compute the product of a given interval of PRIMES;
  - do { FACTOR_LIST_STORE (i, ...); i += 2;} while (i <= tn);
   to compute the product of a given interval of ODD-NUMBERS;
  - do { FACTOR_LIST_STORE (i, ...);} while (++i <= n);
   for the very basecase computation of n! (disabled on many archs, where
  FAC_ODD_THRESHOLD=0): product of a given interval of NATURAL-NUMBERS;

  Each one would need a slightly different optimisation strategy... But you
  are right, when you suggest to act on the "odd-factor" context, it is the
  one where a speed-up can give an improvement on performances.

I misread the code, I thought they were used more similarly.

I guess it is only for the code for smaller computations my streamlining
suggestions might make any difference.  Avoiding unpredicatble branches
in the code multipliyng primes, will be trickier, but not impossible.
(Such code should be written assuming the prime gap is less than some
limit (say GMP_LIMB_BITS) and the next sieved prime could then be
extracted with some word operations and a count_XXXXX_zeros.  We could
multiply several sieved primes in the style of mulN by looking for found
primes in adjacent words, not enforcing them to be consecutive.  But
this is hardly worth the effort.)

  > For a 64-bit machins, using a 1000-byte table, we get 125 limb entries,

  Should we add such a big table only for the factorial?

Considering that the fac_ui binary is about 4000 bytes, a 1k table is
not too bad.

I was suggesting the use of the table also for saving some mulN code.

  Another possible strategy would be to add another intermediate algorithm
  for the oddfactorial:
   - instead of computing product of odd n in [3..100] times odd n in
  [3..50] times odd n in [3..25] ...
   - we can compute: odd n in ]50..100] times (odd n in ]25..50])^2, times
  (odd n in [13..25])^3 ...
  This would probably add only a 1000-byte-of-code function... but I'm
  hesitating.

This is a clever idea.  I am not sure how to organise the actual
computation, assuming this is (semi-) basecase code.

An alternative to my table of limbs, one might table m! as multilimb
value for selected m, together with (\prod_d^m) and (\prod_d^m)^(-1) mod
B^k for same m and selected d and suitably large k.  When asked to
compute n!, the closest tabled m! would be selected, and then adjusted
with the other tabled values...

  > Saving the non-constant division for computing max_prod (and its name
  > variants) is also important.

  Yes, an approximation should be enough... In particular the continuous
  update in the loop
    for (; ...; tn >>= 1) {
      max_prod = GMP_NUMB_MAX / tn;
      ...
  is not clever... updating tn >>= 1; max_prod <<= 1; should be enough.

  I'll change it.

I think I decided that loop was used just for large n! computations, and
that it furthermore looped propotionally to n, and therefore was not too
worried.

The division in the basecase code is probably worse, in terms of
percentual cost.

  > (Should my binom code also work with odd factors?  I haven't thought
  about that.)

  My tests with binom code used a function:
  mpz_prodd_1_1 (mpz_ptr x, mp_limb_t l, mp_limb_t h)

  The core of it is:

    do {
      max_prod = GMP_NUMB_MAX / h;
      i = l | 1;
      do {
        FACTOR_LIST_STORE (i, prod, max_prod, factors, j);
        i += 2;
      } while (i <= h);
      h >>= 1;
      l = (l + 1) >> 1;
    } while (h > l);

  ... this (stores the factors to) compute the odd component of the product
  of the natural numbers in [l..h] (the last factor h>>ctz(h) is handled
  separately). Again it would benefit of a clever optimisation for the
  product of consecutive odd numbers...

I suppose here the mulN trick is applicable.

  I'm very interested in your opinion about the mpz_prodlimbs(X, F[], J)
  function. It is used to compute in X the product of the J limbs contained
  in the array F (the array is overwritten, but it can not overlap with X).

I will take a look in some time.

  The strategy of the current code is: split the list in two parts
  recursively, below a given threshold (related to TOOM22) loop on mul_1,
  multiply the two parts.

OK, this makes sense.  It should result nearly perfect balance.

  It might be better to use mul_1 at first, to accumulate the factors up-to
  a given block size (again TOOM22_THRS) using as many limbs as needed (we
  do not know "a priori", because it depends on how "full" the limbs are),
  then use the "split in two parts" strategy with blocks.

(Perhaps some fudge factor could be used to compensate for the
non-fullness of the limbs?)

I am not sure I understand this alternative method.  Are you suggesting
that one starts multiplyng limbs, forming some number of TOOM22_THRS
sized chunks, and then start multiplying these chunks recursively?

Let's see.  For huge computations, the balance of the few largest
multiplies will determine speed.  The "splitting factor" is 2, meaning
that cumulative basecase data size will be linear in the final data size
(unlike e.g. TOOM) and therefore basecase time will be negligible.

I have poorer intuition for the situation for smaller computations.

  Should we measure a new threshold for this block size?
  MUL_TOOM22_THRESHOLD is based on balanced products, while here we should
  compare balanced with mul_1...

Good question.  Using a fudge factor computed from the limb size and the
limb count for when this entire algorithm is used, might help a little.
(E.g., if this code is used for n! computation for n > 10000, and word
size is 64 bits, the average non-fullness should be log2(10000)/2 = 6.64
for a fudge factor of about 64/(64-6.64) = 1.12.)

-- 
Torbjörn