New T3/T4 code batch

Thu Apr 4 13:11:27 CEST 2013

David Miller <davem at davemloft.net> writes:

  > First mul_1, renamed again, now encoding the load scheduling.  Only the
  > 6c variant is new.  Please time it.  If it doesn't run at 3 c/l, then
  > there are 2 simple things to try, indicated in a comment.

  This gets the expected 3 cycles per limb on T4, and T3 gets
  23 cycles per limb.

Good.  There is method in the madness...

  Numbers are (this is speed -C output):

  			T3	T4
  mpn_modexact_1c_odd	30	26
  mpn_mod_1s_4		30	4
  mpn_bdiv_dbm1c		25	4

Thanks.

  I guess that testing mpn_mod_1s_4 covers things like mpn_invert_limb?

Yes, but not the whole range.  I have a separate test program (not
checked in for some reason).

  I get bus errors in mpn_invert_limb during the testsuite run.  It
  looks like approx_tab is only byte aligned, so the lduh gets a
  SIGBUS.  Adding an "ALIGN(2)" above the table fixes that problem
  for me.

Oops.  I was lucky with alignment in my testing.

  > Could you also please time the current copyi and copyd?

  copi and copyd both get 2 cycles/limb on T4 and and 6 cycles/limb
  on T3.

OK, so that's already L1d port saturation for T4.

  > 5. Write the various addlsh, sublsh, rshadd, rshsub functions.  Again,
  >    2-way unrolling should be adequate in most cases.  An exception is
  >    addlsh1_n which should be 4-way, using two chains of addxccc, making
  >    heavy use of carry flag register renaming, ideally reaching 3.25 c/l.
  >    We could make analogous sublsh1_n and rsblsh1_n, hitting 3.75 c/l.
  >    All other functions in this group would need sllx,srlx,or for
  >    shifting, adding 1 c/l to the add_n speed (since that was not
  >    issue-constrained...) and 1.5 c/l to the sub_n speed.

  I have a set of addlsh* I was working on already, so I can handle
  these as well.

I'd write just the general count addlsh_n/sublsh_n/rsblsh_n if I were
you.  I.e., use sllx+srlx+or and a chain of addxccc.  Special code for
count 1 could save almost a cycle for wide unrolling, but is hairy.  (I
wrote the start of a 3.125 addlsh1_n, using 8-way unrolling, which I
will not finish now.)

  We also at some point should look into ways to potentially use mpmul.
  I was in the process of writing a test program that would mesaure it's
  cost including all of the loads, stores, setup, and teardown so that
  we can make more accurate analysis in this area in the future.

Sure.  It will be faster for operands {up,un} x {vp,vn} where un >= vn
when vn > SOME_THRESHOLD.  But even then, the best strategy will
sometimes be to pad {vp,vn} operand, sometimes to do a {up,vn} x {vp,vn}
slice first, then do {up,un-vn} x {vp,vn} using a discrete mul_basecase.

I expect sqr_basecase to never benefit from mpmul.

-- 
Torbjörn