GMP on Pentium 2

Sat Nov 8 02:26:57 CET 2003

I found the reason for the 3.7 vs 3.2 cycles/limb performance for
mpn/x86/aors_n.asm on p6.  Alignment.  If the loop start is at an
address 8 mod 16, the loop needs 3.7 cycles/limb, but if it is
aligned 0 mod 16, it needs only 3.2 cycles/limb.  Since the code forces
just 0 mod 8 alignment, both timing results happen depending on
where the code end up being put by the linker.

-- 
Torbjörn