New "fastsse" assembly

Mon Feb 27 16:03:01 CET 2012

I pushed new x86_64 assembly making use of 128-bit instructions working
on xmm registers.  While all x86_64 processors probably support the
instructions used, some have less throughput using these than when using
plain 64-bit instructions.

The idea is to include these just before "x86_64" in the mdep search
path, but I have not done that yet; I want to look for small-operands
regressions first.

The files are:

  * copyi.asm and copyd.asm
  * lshift.asm (written in cooperation with David Harvey)

The challenge when using 128-bit ops is alignment; the limbs are just 64
bits while we work with 128 bit ops, and this means operand alignment is
not necessarily better than 64 bits.

The code cannot write the first or last limb with 128-bit operations
unless these are aligned (the last limb is aligned if either the src is
unaligned and the count is odd, or if the src is aligned and the count
us even).  It is however fine to make an *aligned* read using 128-bit
ops always, even if this sometimes mean we read outside of a defined
operand (although valgrind seem to dislike that practice...).

Further development needed:

* The lshift code is now not unrolled.  Unroll it 2x or 4x to achieve
  even better performance (note that the code already runs well on 5 of
  10 CPUs).

* Make sure lshift does not cause slowdown for small operands.  If
  needed add basecase code to counter slowdown.

* Consider loopmixing for individual CPUs.

* When lshift is finished, write analogous rshift.

* Write copyi/copyd that runs well also on core2 (see comments in these
  files; basically split loop into two, using movqda also for reads).

-- 
Torbjörn