Performance of addlsh_n and sublsh_n

Thu Feb 3 08:33:14 CET 2011

nisse at lysator.liu.se (Niels Möller) writes:

  With k-times unrolling we'd get 4 + 7k instructions. On AMD
  processors, 2.75 cycles/limb with k = 4 and 2.5 c/l with (unrealistic)
  k = 8.

I agree; only with impossible unrolling does the used type of code beat
the current addmul_1 code.

  Since cycles per limb seems to be decoder-limited, I appears to be
  hard to beat addmul_1, which should need fewer instructions and have
  an equivalent recurrency chain.

We should consider a few alternatives:

(1) Using shrd/shld double-shifts.  They are OK on Intel chips, but I
think they might need more decoder resources on AMD chips than even a
mul.

(2) Using psrl/psll, or some other SSE2 or SSEwhatnot instructions.
Saves negations, since the shift count can be in any register.

  I see that mpn/x86_64/aorlsh_n.asm is not quite organized like the
  code above. But I imagine one would get something a bit similar if one
  unrolls a few times and eliminates some of the neg %cl.

  One advantage with the above organization is that the addition into
  the new hi can't overflow, hence we don't need to save the carry flag
  between iterations.

I think you might be mistaken about that you will not get carry between
iterations.  The value added from (up) is arbitrary and independent from
the shifted values.  Adding anything to B-1 that will carry.  If you can
do addlsh_n without a ripple carry, then surely you can supress ripple
carry from plain addition.

I wrote some new code targeting Intel processors, code which uses shrd.
Please see the new numbers within [] at http://gmplib.org/devel/asm.html.
(Performance is OK, the code beats addmul_1 well, except on Nehalem where
it beats addmul_1 only

For AMD, Niels shows that we have little chances of beating an addmul_1
loop.  I have not worked on that; starting with the existing 2.5 c/l
code, adding a new pointer using the already setup indexing, should not
be hard.

-- 
Torbjörn