AMD-64 optimizations, some (new) code

Mon Sep 26 12:09:03 CEST 2005

Ashod Nakashian <saghmos at xter.net> writes:

  Finally I've successfully ported popham.asm and mul_1.asm. mul_1
  uses software prefetching and my tests show that the current code
  is the fastest (~3 c/l, and as low as 2.3 c/l for about 700-750
  limbs).

I cannot reproduce 2.3 c/l for mpn_mul_1.
The best value I get is 3.3 for n=900.
The loop needs 68 decode cycles, or 2.83 decode cycles per limb.

--
Torbjörn