When choosing an approach for an assembly loop, consideration is given to what operations can execute simultaneously and what throughput can thereby be achieved. In some cases an algorithm can be tweaked to accommodate available resources.
Loop control will generally require a counter and pointer updates, costing as much as 5 instructions, plus any delays a branch introduces. CPU addressing modes might reduce pointer updates, perhaps by allowing just one updating pointer and others expressed as offsets from it, or on CISC chips with all addressing done with the loop counter as a scaled index.
The final loop control cost can be amortised by processing several limbs in each iteration (see Assembly Loop Unrolling). This at least ensures loop control isn’t a big fraction the work done.
Memory throughput is always a limit. If perhaps only one load or one store
can be done per cycle then 3 cycles/limb will the top speed for “binary”
mpn_add_n, and any code achieving that is optimal.
Integer resources can be freed up by having the loop counter in a float register, or by pressing the float units into use for some multiplying, perhaps doing every second limb on the float side (see Assembly Floating Point).
Float resources can be freed up by doing carry propagation on the integer side, or even by doing integer to float conversions in integers using bit twiddling.