fac_ui rewrote.

Torbjorn Granlund tg at gmplib.org
Sun Jan 15 15:21:41 CET 2012

I have now read the new factorial code looking for possible further
improvements.

For the limb accumulation, the code uses this macro:

#define FACTOR_LIST_STORE(P, PR, MAX_PR, VEC, I)                \
do {                                                          \
if ((PR) > (MAX_PR)) {                                      \
(VEC)[(I)++] = (PR);                                      \
(PR) = (P);                                               \
} else                                                      \
(PR) *= (P);                                              \
} while (0)

I think this can be improved along two possible lines.

1. Use an odd-factor variant of the mulN functions (introduced in my
binomial code, see https://gmplib.org/devel/).

2. Table pre-accumulated odd-factor products up to some limit n.  I
think this does not merit a huge table, but perhaps up to a few
hundred limbs could be put into the lib.

The reason (1) is a possible improvement is twofold: it avoids branch
misprediction costs, and it breaks the chain of dependent multiplies.

Using functions mulN makes the code clean, but there is a a slight cost
for calling a function, in particular for machines with deep pipeline
but no prediction for jumps/call through a pointer (this is now unusual
for high-end x86 processors).  At the cost of code size, we could
instead inline the functions, using (say) 8 loops.  (Since you invoke
FACTOR_LIST_STORE inn several places, this might be slightly
unattractive, unless we can put such code in one place.)

If we add a table as per (2), we will need mulN functions (or inlined
loops) for smaller value of N, without accumulating fewer-than-possible
factors, and thereby losing some performance.

For a 64-bit machins, using a 1000-byte table, we get 125 limb entries,
with first omitted odd factor being 1633.  Accumulation from that point
on will require 5 or fewer factors.

Saving the non-constant division for computing max_prod (and its name
variants) is also important.  Choosing mulN functions should be done
with a small loop, see MAXFACS in e.g., smallk_bin_uiui.c.  An integer
divison (by a non-constant divisor) cost 50-100 cycles.  (Using a table
as per (2) we can truncate the N selection table at its head.)

(Should my binom code also work with odd factors?  I haven't thought