div_qr_1 interface

Tue Oct 22 13:21:39 CEST 2013

nisse at lysator.liu.se (Niels Möller) writes:

  Torbjorn Granlund <tg at gmplib.org> writes:

  > * The code is no win for AMD k10/k8 (although close to 10 c/l might well be
  >   possible)

  I tried replacing one masking op by cmov, as you suggested. We then get
  down to 11.25 c/l on K10. I put this modified version in the k10
  subdirectory, since it was a significant slowdown on some other
  processors.

Nice speedup!  It is not too far from decoder saturated now, I presume.

I think the right place for the file is the k8 subdir, not the k10
subdir.  Their pipelines are almost identical, so the k10 subdir are
used just for code which uses instructions not available on k10.

  Next thing to try is to delay the Q1 store, but that's a bit more work.
  After that, I guess I should try the loop mixer.

I think k8-k10 are losing importance since they aren't made since
several years.  AMD bulldriver/piledriver are not terribly important GMP
targets either, since they have a hopelessly slow integer multiply unit.

The most important targets are sandybridge/ivybridge (similar pipelines)
and haswell.  Less important are nehalem/westmere (very similar
pipelines).  Conroe and the other core2 processors are not important,
except for your laptop.  :-)

I think haswell code could be made a few cycles faster by using the mulx
instruction.  That will avoid the copying forth and back of rax.

-- 
Torbjörn