AMD-64 optimizations, some (new) code
saghmos at xter.net
Mon Sep 26 07:26:17 CEST 2005
I've spent some time hacking away some code for AMD-64. I have some
mixed results. I thought I have enough to share at this point.
I've tried to port several modules, and here are my findings.
First, in lshift and rshift modules shrd and shld run quite slowly
(vector paths), GCC is doing a better job with the C code. In most tests
I found no improvement in speed using my ported version of the k7
assembly. GMP-bench is giving a little bit lower results with the
assembly code. I assume this was expected (I'm not surprised), but I
expect a well software pipelined version of the C version, unrolled a
few-several times would give better throughput than the current ~3.x
c/l. lshift.asm and rshift.asm are in the package, but the filenames are
appended with an underscore to avoid compiling them by default. Rename
A difficulty I encountered with some of the code had to do with the fact
that the new ABI has different global offset calculation methods, so the
old code for loading values from the global 'mod inverse' table doesn't
work, and you'll get an error when linking. I've tried many different
methods suggested in the AMD-64 ABI reference, but none worked. Finally,
I've copied the table into the assembly module(s) in question and
changed the code to load from that table. I assumed that since this way
no global table offset is needed, then I won't have to face that
problem. Although the code compiled and run, the results are incorrect.
('make check' fails.) The affected modules are dive_1.asm and
mode1o.asm. The problem might be in some bug in the code, or the table
offset calculation is still not working. I've tried to track down the
bugs, but my eyes seem to not find them. With the hope that someone can
point out where the problem is, I've attached these two modules, but the
files are appended with 'underscore x' ('_x') to mark it as broken.
Finally I've successfully ported popham.asm and mul_1.asm. mul_1 uses
software prefetching and my tests show that the current code is the
fastest (~3 c/l, and as low as 2.3 c/l for about 700-750 limbs). My
tests show they are solid. But again, all this is very experimental. USE
AT YOUR OWN RISK.
I've updated the README and changed the default compiler options.
I'd appreciate constructive comments and suggestions. Also any test
results (+ve and -ve) is very welcome. gmp-bench results as well, but
include you system config. please. I'm currently working on divrem_1.asm.
P.S. All development and tests are done on a Fedora Core 2 box, using
GCC 3.3.3. I'm sending this mail from a Windows box.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 34982 bytes
Desc: not available
Url : http://gmplib.org/list-archives/gmp-discuss/attachments/20050926/39f6a30f/x86-64-0001.zip
More information about the gmp-discuss