[PATCH] Rewrite T3/T4 {add,sub}_n.asm

Thu Apr 4 23:50:03 CEST 2013

David Miller <davem at davemloft.net> writes:

  This meets all of the theoretical performance goals we mentioned
  the other day.

Good!

  T3 seems to be much more sensitive to the loop alignment than T4 is.
  For example, if I take out the ALIGN(16) and the annulling branch
  from add_n.asm, T3 takes an extra cycle to execute (thus 8.5 c/l)

I put ALIGN(32) in mul_1.asm and aormul_2.asm.  Should these be
ALIGN(16) for a start?

  I might want to come back to this and see if we can instead align the
  whole function and either end up with the loop perfectly aligned
  without any changes or insert one or two nops as necessary to achieve
  that.  It'll be cheaper, either way, than the unconditional annulled
  branch.

That's probably a good idea.  Indeed, I think we do this for a few x86
functions.

  Validated with "make check" and "try".

Pushed, after having remapped g2 to g5 for the benefit of my emulation
code.  (Yes, I should fix that...)

(Also updated gmplib.org/devel/asm.html.)

-- 
Torbjörn