Loop unrolling consists of replicating code so that several limbs are
processed in each loop. At a minimum this reduces loop overheads by a
corresponding factor, but it can also allow better register usage, for example
alternately using one register combination and then another. Judicious use of
m4
macros can help avoid lots of duplication in the source code.
Any amount of unrolling can be handled with a loop counter that’s decremented by N each time, stopping when the remaining count is less than the further N the loop will process. Or by subtracting N at the start, the termination condition becomes when the counter C is less than 0 (and the count of remaining limbs is C+N).
Alternately for a power of 2 unroll the loop count and remainder can be established with a shift and mask. This is convenient if also making a computed jump into the middle of a large loop.
The limbs not a multiple of the unrolling can be handled in various ways, for example
switch
statement, providing separate code for each possible excess,
for example an 8-limb unrolling would have separate code for 0 remaining, 1
remaining, etc, up to 7 remaining. This might take a lot of code, but may be
the best way to optimize all cases in combination with a deep pipelined loop.