AMD-64 optimizations, some (new) code

Mon Sep 26 07:26:17 CEST 2005

Hi,

I've spent some time hacking away some code for AMD-64. I have some 
mixed results. I thought I have enough to share at this point.

I've tried to port several modules, and here are my findings.

First, in lshift and rshift modules shrd and shld run quite slowly 
(vector paths), GCC is doing a better job with the C code. In most tests 
I found no improvement in speed using my ported version of the k7 
assembly. GMP-bench is giving a little bit lower results with the 
assembly code. I assume this was expected (I'm not surprised), but I 
expect a well software pipelined version of the C version, unrolled a 
few-several times would give better throughput than the current ~3.x 
c/l. lshift.asm and rshift.asm are in the package, but the filenames are 
appended with an underscore to avoid compiling them by default. Rename 
to experiment.

A difficulty I encountered with some of the code had to do with the fact 
that the new ABI has different global offset calculation methods, so the 
old code for loading values from the global 'mod inverse' table doesn't 
work, and you'll get an error when linking. I've tried many different 
methods suggested in the AMD-64 ABI reference, but none worked. Finally, 
I've copied the table into the assembly module(s) in question and 
changed the code to load from that table. I assumed that since this way 
no global table offset is needed, then I won't have to face that 
problem. Although the code compiled and run, the results are incorrect. 
('make check' fails.) The affected modules are dive_1.asm and 
mode1o.asm. The problem might be in some bug in the code, or the table 
offset calculation is still not working. I've tried to track down the 
bugs, but my eyes seem to not find them. With the hope that someone can 
point out where the problem is, I've attached these two modules, but the 
files are appended with 'underscore x' ('_x') to mark it as broken.

Finally I've successfully ported popham.asm and mul_1.asm. mul_1 uses 
software prefetching and my tests show that the current code is the 
fastest (~3 c/l, and as low as 2.3 c/l for about 700-750 limbs). My 
tests show they are solid. But again, all this is very experimental. USE 
AT YOUR OWN RISK.

I've updated the README and changed the default compiler options.

I'd appreciate constructive comments and suggestions. Also any test 
results (+ve and -ve) is very welcome. gmp-bench results as well, but 
include you system config. please. I'm currently working on divrem_1.asm.

Regards,
Ashod Nakashian

P.S. All development and tests are done on a Fedora Core 2 box, using 
GCC 3.3.3. I'm sending this mail from a Windows box.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: x86-64.zip
Type: application/zip
Size: 34982 bytes
Desc: not available
Url : http://gmplib.org/list-archives/gmp-discuss/attachments/20050926/39f6a30f/x86-64-0001.zip