Performance of addlsh_n and sublsh_n

Niels Möller nisse at lysator.liu.se
Wed Feb 2 21:10:49 CET 2011


Torbjorn Granlund <tg at gmplib.org> writes:

> On AMD K8, K9, K10, and Intel Sandy Bridge, addlsh_n and sublsh_n are
> slower than addmul_1 and submul_1.

Is the mul instruction is faster than the logic needed to shift a single
limb operand left and get a two-limb result? Let me think aloud for a bit...

The shift would be something like

   C Set (lo, hi) <- 2^{%cl} lo
   mov lo, hi
   shl %cl, lo
   neg %cl
   shr %cl, hi
   neg %cl

A corresponding addmul_1-style loop, without unrolling. The recurrency
variable is hi:

.Loop:
   mov	(vp, i), vlo
   mov	vlo, vhi
   shl  %cl, vlo
   neg  %cl
   shr  %cl, vhi
   neg  %cl
   add	hi, vlo
   adc	(up, i), vhi
   mov	vhi, hi		C Should be eliminated by unrolling
   mov	vlo, (rp, i)
   add	$8, i
   jnc	.Loop

At least recurrency chain via hi is short (just like addmul_1). 7
instructions per iteration if I don't count loop-control and negation of
the shift count, which can amortized by unrolling. With k-times
unrolling we'd get 4 + 7k instructions. On AMD processors, 2.75
cycles/limb with k = 4 and 2.5 c/l with (unrealistic) k = 8.

Since cycles per limb seems to be decoder-limited, I appears to be hard
to beat addmul_1, which should need fewer instructions and have an
equivalent recurrency chain.

I see that mpn/x86_64/aorlsh_n.asm is not quite organized like the code
above. But I imagine one would get something a bit similar if one
unrolls a few times and eliminates some of the neg %cl. Then maybe the
only difference is that the above approach adds the left and
right-shifted words independently into the collected result (one add and
one adc per limb), while the current unrolled code first ors together
left and right-shifted words from adjacent iterations (one or and one
adc per limb).

One advantage with the above organization is that the addition into the
new hi can't overflow, hence we don't need to save the carry flag
between iterations.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.


More information about the gmp-devel mailing list