fac_ui rewrote.

Torbjorn Granlund tg at gmplib.org
Sun Jan 15 15:21:41 CET 2012


I have now read the new factorial code looking for possible further
improvements.

For the limb accumulation, the code uses this macro:

#define FACTOR_LIST_STORE(P, PR, MAX_PR, VEC, I)                \
  do {                                                          \
    if ((PR) > (MAX_PR)) {                                      \
      (VEC)[(I)++] = (PR);                                      \
      (PR) = (P);                                               \
    } else                                                      \
      (PR) *= (P);                                              \
  } while (0)

I think this can be improved along two possible lines.

1. Use an odd-factor variant of the mulN functions (introduced in my
   binomial code, see https://gmplib.org/devel/).

2. Table pre-accumulated odd-factor products up to some limit n.  I
   think this does not merit a huge table, but perhaps up to a few
   hundred limbs could be put into the lib.

The reason (1) is a possible improvement is twofold: it avoids branch
misprediction costs, and it breaks the chain of dependent multiplies.

Using functions mulN makes the code clean, but there is a a slight cost
for calling a function, in particular for machines with deep pipeline
but no prediction for jumps/call through a pointer (this is now unusual
for high-end x86 processors).  At the cost of code size, we could
instead inline the functions, using (say) 8 loops.  (Since you invoke
FACTOR_LIST_STORE inn several places, this might be slightly
unattractive, unless we can put such code in one place.)

If we add a table as per (2), we will need mulN functions (or inlined
loops) for smaller value of N, without accumulating fewer-than-possible
factors, and thereby losing some performance.

For a 64-bit machins, using a 1000-byte table, we get 125 limb entries,
with first omitted odd factor being 1633.  Accumulation from that point
on will require 5 or fewer factors.

Saving the non-constant division for computing max_prod (and its name
variants) is also important.  Choosing mulN functions should be done
with a small loop, see MAXFACS in e.g., smallk_bin_uiui.c.  An integer
divison (by a non-constant divisor) cost 50-100 cycles.  (Using a table
as per (2) we can truncate the N selection table at its head.)

(Should my binom code also work with odd factors?  I haven't thought
about that.)

-- 
Torbjörn


More information about the gmp-devel mailing list