Marco Bodrato writes:

> Well, I added one more move to order the cases as you suggest. The
> code gets a little bit shorter.

Thanks, looks good to me. I think one more instruction is easy to move,
see below.

> I also renamed registers, so that a push/pop couple is needed only if
> the loop is used; this may save a couple of cycles when the size is
> small. Does this make sense?

Makes sense.

> L(end):	mul	%r9
> 	add	%rax, %r11
> 	adc	%rdx, %r10
> 	cmp	$1, R32(n)
> 	ja	L(two)
> 	jnz	L(nul)
> 	mov	-8(ap), %rax      <-- 1

I think this instruction and the one marked "2" below can be moved to
the start of the L(ona): part, just before the mul %r8 ("3" below).
Slightly worse scheduling, though.

> 	mov	%r11, -16(rp)
> 	mov	%r10, %r11
> 	jmp	L(one)

I had hoped this jump and preceding instructions could be eliminated, to
get a structure like

 	ja	L(two)
	jz	L(one)

L(nul): (no jumps to this label left)
        fall through 
        fall through
        function exit

But might need other move instructions, to get the right data into the
right registers?

> L(nul):	mov	-16(ap), %rax
> 	mov	%r11, -24(rp)
> 	mul	%r8
> 	add	%rax, %r10
> 	mov	-16(bp), %rax
> 	mov	$0, R32(%r11)
> 	adc	%rdx, %r11
> 	mul	%r9
> 	add	%rax, %r10
> 	mov	-8(ap), %rax      <-- 2
> 	adc	%rdx, %r11
> 	mov	%r10, -16(rp)
> L(one):	mul	%r8       <-- 3
> 	add	%rax, %r11
> 	mov	-8(bp), %rax
> 	mov	$0, R32(%r10)
> 	adc	%rdx, %r10
> 	mul	%r9
> 	add	%rax, %r11
> 	adc	%rdx, %r10
> L(two):	mov	%r11, -8(rp)
> 	mov	%r10, %rax
> L(ret):	pop	%rbp
> 	ret

So I think your version is an improvement as is, and perhaps not worth
the effort to try to eliminate a few more instructions if this rather
obscure function.


