Performance of addlsh_n and sublsh_n
tg at gmplib.org
Thu Feb 3 08:33:14 CET 2011
nisse at lysator.liu.se (Niels Möller) writes:
With k-times unrolling we'd get 4 + 7k instructions. On AMD
processors, 2.75 cycles/limb with k = 4 and 2.5 c/l with (unrealistic)
k = 8.
I agree; only with impossible unrolling does the used type of code beat
the current addmul_1 code.
Since cycles per limb seems to be decoder-limited, I appears to be
hard to beat addmul_1, which should need fewer instructions and have
an equivalent recurrency chain.
We should consider a few alternatives:
(1) Using shrd/shld double-shifts. They are OK on Intel chips, but I
think they might need more decoder resources on AMD chips than even a
(2) Using psrl/psll, or some other SSE2 or SSEwhatnot instructions.
Saves negations, since the shift count can be in any register.
I see that mpn/x86_64/aorlsh_n.asm is not quite organized like the
code above. But I imagine one would get something a bit similar if one
unrolls a few times and eliminates some of the neg %cl.
One advantage with the above organization is that the addition into
the new hi can't overflow, hence we don't need to save the carry flag
I think you might be mistaken about that you will not get carry between
iterations. The value added from (up) is arbitrary and independent from
the shifted values. Adding anything to B-1 that will carry. If you can
do addlsh_n without a ripple carry, then surely you can supress ripple
carry from plain addition.
I wrote some new code targeting Intel processors, code which uses shrd.
Please see the new numbers within  at http://gmplib.org/devel/asm.html.
(Performance is OK, the code beats addmul_1 well, except on Nehalem where
it beats addmul_1 only
For AMD, Niels shows that we have little chances of beating an addmul_1
loop. I have not worked on that; starting with the existing 2.5 c/l
code, adding a new pointer using the already setup indexing, should not
More information about the gmp-devel