AMD-64 optimizations, some (new) code

Mon Sep 26 21:39:54 CEST 2005

Torbjorn Granlund wrote:
> Ashod Nakashian <saghmos at xter.net> writes:
> 
>   Well, I'd agree with you, if the numbers weren't so
>   consistent. In fact, they are so consistent that two runs of the
>   same range give, in quite a lot of points, the very same results,
>   to the last digit after the point. This includes the fastest
>   point, which, by the way, is NOT 2.3 c/l (sorry), but 2.17 !!!
>   Not a typo. This value is reached at 920 limbs.
>   
> Forgive me, but I am skeptical that these are the right numbers.
> Something is wrong.  I'd like to find out what.
> 
> I know the K8 pipeline quite well, and its pipeline is simply not
> capable of executing the mul_1.asm loop at less than 2.83 c/l, and
> only if the MULQ instructions were placed to never straddle a
> decoder border.
> 
> The number I measure is consistent with what it should run at,
> from a theoretical analysis.  I have organized the code below to
> reflect how I think the decoders will work though the code.  The
> decoders will be busy for 25 cycles (corresponding to 3.125 c/l).
> I measure 26 cycles for one entire loop iteration; the extra cycle
> might be due to scheduling issues.
> 
> L(oop):
> 	prefetcht0 128(%rsi)
> 	movq	(%rsi), %rax
> 	movq	8(%rsi), %r10
> 
> 	mulq	%rcx
> 	movq	$0, %r8
> 
> 	addq	%rax, %r11
> 	adcq	%rdx, %r8
> 	movq	%r10, %rax
> 
> 	movq	16(%rsi), %r10
> 	movq	%r11, (%rdi)
> 	# empty decoder slot
> 
> 	mulq	%rcx
> 	movq	$0, %r11
> 
> 	addq	%rax, %r8	C new lo + cylimb
> 	adcq	%rdx, %r11
> 	movq	%r10, %rax
> 
> 	movq	24(%rsi), %r10
> 	movq	%r8, 8(%rdi)
> 	# empty decoder slot
> 
> 	mulq	%rcx
> 	movq	$0, %r8
> 
> 	addq	%rax, %r11	C new lo + cylimb
> 	adcq	%rdx, %r8
> 	movq	%r10, %rax
> 
> 	movq	32(%rsi), %r10
> 	movq	%r11, 16(%rdi)
> 	# empty decoder slot
> 
> 	mulq	%rcx
> 	movq	$0, %r11
> 
> 	addq	%rax, %r8	C new lo + cylimb
> 	adcq	%rdx, %r11
> 	movq	%r10, %rax
> 
> 	movq	40(%rsi), %r10
> 	movq	%r8, 24(%rdi)
> 	# empty decoder slot
> 
> 	mulq	%rcx
> 	movq	$0, %r8
> 
> 	addq	%rax, %r11	C new lo + cylimb
> 	adcq	%rdx, %r8
> 	movq	%r10, %rax
> 
> 	movq	48(%rsi), %r10
> 	movq	%r11, 32(%rdi)
> 	# empty decoder slot
> 
> 	mulq	%rcx
> 	movq	$0, %r11
> 
> 	addq	%rax, %r8	C new lo + cylimb
> 	adcq	%rdx, %r11
> 	movq	%r10, %rax
> 
> 	movq	56(%rsi), %r10
> 	movq	%r8, 40(%rdi)
> 	# empty decoder slot
> 
> 	mulq	%rcx
> 	movq	$0, %r8
> 
> 	addq	%rax, %r11	C new lo + cylimb
> 	adcq	%rdx, %r8
> 	movq	%r10, %rax
> 
> 	movq	%r11, 48(%rdi)
> 	mulq	%rcx
> 
> 	movq	$0, %r11
> 	addq	%rax, %r8	C new lo + cylimb
> 	adcq	%rdx, %r11
> 
> 	movq	%r8, 56(%rdi)
> 	leaq	64(%rsi), %rsi
> 	leaq	64(%rdi), %rdi
> 
> 	decq	%r9
> 	jnz	L(oop)
> 	# empty decoder slot
> 
> 
>   So my only question is, could this be the differene bettwen
>   processor revisions? I mean we have seen HUGE changes in
>   decode/dispatch speed of certain instructions from revision to
>   revision, the last of which was in P4 Presscot.
>   
> I measured on two different cores, one really early Slegdehammer,
> and one somewhat newer Clawhammer.  They give the same results.
> 
> I am not aware of any major changes in the K8 cores (but have
> been reports that the X2 series runs GMP a tad bit slower, so
> there might be some change made recently).
> 
>   My CPU is Clawhammer, 0.13 micron process. I had more info, like
>   the revision and stepping, I can't find them now. I'll try to get
>   some software to dump the info and I'll send it if you are
>   interested.
>   
> This program might be useful:
> 
> 
> 
> ------------------------------------------------------------------------
> 
> 
>   Now, to sort this out, I also attached 3 different runs of
>   'speed' with mpn_mul_1.1. 50-50k in 10 steps, 600-1200 in 10
>   steps (this is the range where the fastest timings are found) and
>   600-1200 in 25 steps (just to put the previous number in
>   prespective and see the overall graph which is identical in the 3
>   runs.) I dared send other runs, since the data is really very
>   consistent, but I guess these would do. You can check out the
>   data, and see the comman-line parammeters passed to 'speed'. Hope
>   it helps you.
>   
> OK, I think we've nailed it now.  I guess you're using "speed
> -CD" and assume the smallest numbers represent the speed of the
> loop.  That's not right.  Get rid of the D flags to see actual
> numbers.
> 
> --
> Torbjörn

 > Forgive me, but I am skeptical that these are the right numbers.
 > Something is wrong.  I'd like to find out what.

I understand how you did the math; My original calculated throughput 
limit was 3 c/l. I'm surprised as well at these results.

 > OK, I think we've nailed it now.  I guess you're using "speed
 > -CD" and assume the smallest numbers represent the speed of the
 > loop.  That's not right.  Get rid of the D flags to see actual
 > numbers.

'speed' dumps the original command line at the top of every .gnuplot 
file. So, for example open any of the attached mul_1.gnuplot files, and 
you'll read the command line I used. Take the following for example:

# Generated with:
# ./speed -s 50-50000 -t 10 -C -P mul_1 mpn_mul_1.1

set key left
set data style linespoints
plot  "mul_1.data" using 1:2 title "mpn_mul_1.1"
load "-"

I really don't see any -CD. I do use -C (which means "per limb time"). I 
don't know what -D does or -CD (if combining makes a difference.) Does 
this command line agree with your assumptions?

What else could result in such unexpected results, besides core 
differences and problems with 'speed'?

Should I perform the tests in some other manner (say, write my own code 
to do the math and use 'time' to see the overall time then do the math 
to get the per limb value minus overhead). I don't know if that would 
help, but it is quit dissapointing to have such findings that cannot be 
explained rationaly.

On the other hand, I think 3.3 c/l is pretty decent performance, 
wouldn't you agree? (putting aside that the code is arguably even faster 
than that.)

Regards,
Ash