AMD-64 optimizations, some (new) code

Mon Sep 26 22:56:23 CEST 2005

Ashod Nakashian <saghmos at xter.net> writes:

  'speed' dumps the original command line at the top of every .gnuplot
  file. So, for example open any of the attached mul_1.gnuplot files, and
  you'll read the command line I used. Take the following for example:

  # Generated with:
  # ./speed -s 50-50000 -t 10 -C -P mul_1 mpn_mul_1.1

I looked at the raw data.

  I really don't see any -CD. I do use -C (which means "per limb time"). I
  don't know what -D does or -CD (if combining makes a difference.) Does
  this command line agree with your assumptions?

No.  But I am positive there is something fishy with the
measurements, after the analysis, after my own measurements, and
after having seen your fluctuating measurements.  Sorry, it would
have been more fun if the code actually ran at close to 2 c/l.
:-)

  Should I perform the tests in some other manner (say, write my own code
  to do the math and use 'time' to see the overall time then do the math
  to get the per limb value minus overhead). I don't know if that would
  help, but it is quite disappointing to have such findings that cannot be
  explained rationally.

I am sure it is possible to explain "rationally".  You just need
to systematically debug what is wrong with speed.  There might be
a bug in speed, or a bug in the compiler with which you compiled
speed.

  On the other hand, I think 3.3 c/l is pretty decent performance,
  wouldn't you agree? (putting aside that the code is arguably even faster
  than that.)

3.3 is definitely good performance.  The GMP development code
(scheduled for GMP 5) has similar speed (3.0).  I haven't been
able to get under 3.0, although I have tried hard.

It turns out that it is possible to reach 3.0 with quite simple
code:

	TEXT
	ALIGN(16)
	.byte	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
ASM_START()
PROLOGUE(mpn_mul_1)
	movq	%rdx, %r11
	leaq	(%rsi,%rdx,8), %rsi
	leaq	(%rdi,%rdx,8), %rdi
	negq	%r11
	xorl	%r8d, %r8d

.Loop:	movq	(%rsi,%r11,8), %rax
	mulq	%rcx
	addq	%r8, %rax
	movl	$0, %r8d
	adcq	%rdx, %r8
	movq	%rax, (%rdi,%r11,8)
	incq	%r11
	jne	.Loop

	movq	%r8, %rax
	ret
EPILOGUE()

--
Torbjörn