div_qr_1 interface

Niels Möller nisse at lysator.liu.se
Tue Oct 22 10:28:21 CEST 2013


Torbjorn Granlund <tg at gmplib.org> writes:

> * The code is no win for AMD k10/k8 (although close to 10 c/l might well be
>   possible)

I tried replacing one masking op by cmov, as you suggested. We then get
down to 11.25 c/l on K10. I put this modified version in the k10
subdirectory, since it was a significant slowdown on some other
processors.

Next thing to try is to delay the Q1 store, but that's a bit more work.
After that, I guess I should try the loop mixer.

I benchmarked the code on the k8, k10, core2, sandybridge, nehalem and
nano machines. I couldn't log in to haswell and piledriver.

/Niels


-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.


More information about the gmp-devel mailing list