The problem that presents most challenges in GMP is propagating carries from
one limb to the next. In functions like
mpn_add_n, carries are the only dependencies between limb operations.
On processors with carry flags, a straightforward CISC style
generally best. AMD K6
mpn_addmul_1 however is an example of an
unusual set of circumstances where a branch works out better.
On RISC processors generally an add and compare for overflow is used. This sort of thing can be seen in mpn/generic/aors_n.c. Some carry propagation schemes require 4 instructions, meaning at least 4 cycles per limb, but other schemes may use just 1 or 2. On wide superscalar processors performance may be completely determined by the number of dependent instructions between carry-in and carry-out for each limb.
On vector processors good use can be made of the fact that a carry bit only
very rarely propagates more than one limb. When adding a single bit to a
limb, there’s only a carry out if that limb was
0xFF…FF which on
random data will be only 1 in 2^mp_bits_per_limb. mpn/cray/add_n.c is an example of this, it adds
all limbs in parallel, adds one set of carry bits in parallel and then only
rarely needs to fall through to a loop propagating further carries.
On the x86s, GCC (as of version 2.95.2) doesn’t generate particularly good code
for the RISC style idioms that are necessary to handle carry bits in
C. Often conditional jumps are generated where
would be better. And so unfortunately almost any loop involving carry bits
needs to be coded in assembly for best results.