[PATCH] Rewrite T3/T4 {add,sub}_n.asm
Torbjorn Granlund
tg at gmplib.org
Thu Apr 4 23:50:03 CEST 2013
David Miller <davem at davemloft.net> writes:
This meets all of the theoretical performance goals we mentioned
the other day.
Good!
T3 seems to be much more sensitive to the loop alignment than T4 is.
For example, if I take out the ALIGN(16) and the annulling branch
from add_n.asm, T3 takes an extra cycle to execute (thus 8.5 c/l)
I put ALIGN(32) in mul_1.asm and aormul_2.asm. Should these be
ALIGN(16) for a start?
I might want to come back to this and see if we can instead align the
whole function and either end up with the loop perfectly aligned
without any changes or insert one or two nops as necessary to achieve
that. It'll be cheaper, either way, than the unconditional annulled
branch.
That's probably a good idea. Indeed, I think we do this for a few x86
functions.
Validated with "make check" and "try".
Pushed, after having remapped g2 to g5 for the benefit of my emulation
code. (Yes, I should fix that...)
(Also updated gmplib.org/devel/asm.html.)
--
Torbjörn
More information about the gmp-devel
mailing list