15.8.3 Carry Propagation

The problem that presents most challenges in GMP is propagating carries from one limb to the next. In functions like mpn_addmul_1 and mpn_add_n, carries are the only dependencies between limb operations.

On processors with carry flags, a straightforward CISC style adc is generally best. AMD K6 mpn_addmul_1 however is an example of an unusual set of circumstances where a branch works out better.

On RISC processors generally an add and compare for overflow is used. This sort of thing can be seen in mpn/generic/aors_n.c. Some carry propagation schemes require 4 instructions, meaning at least 4 cycles per limb, but other schemes may use just 1 or 2. On wide superscalar processors performance may be completely determined by the number of dependent instructions between carry-in and carry-out for each limb.

On vector processors good use can be made of the fact that a carry bit only very rarely propagates more than one limb. When adding a single bit to a limb, there’s only a carry out if that limb was 0xFF…FF which on random data will be only 1 in 2^mp_bits_per_limb. mpn/cray/add_n.c is an example of this, it adds all limbs in parallel, adds one set of carry bits in parallel and then only rarely needs to fall through to a loop propagating further carries.

On the x86s, GCC (as of version 2.95.2) doesn’t generate particularly good code for the RISC style idioms that are necessary to handle carry bits in C. Often conditional jumps are generated where adc or sbb forms would be better. And so unfortunately almost any loop involving carry bits needs to be coded in assembly for best results.