div_qr_1 interface
Niels Möller
nisse at lysator.liu.se
Tue Oct 22 10:28:21 CEST 2013
Torbjorn Granlund <tg at gmplib.org> writes:
> * The code is no win for AMD k10/k8 (although close to 10 c/l might well be
> possible)
I tried replacing one masking op by cmov, as you suggested. We then get
down to 11.25 c/l on K10. I put this modified version in the k10
subdirectory, since it was a significant slowdown on some other
processors.
Next thing to try is to delay the Q1 store, but that's a bit more work.
After that, I guess I should try the loop mixer.
I benchmarked the code on the k8, k10, core2, sandybridge, nehalem and
nano machines. I couldn't log in to haswell and piledriver.
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list