New T3/T4 code batch
Torbjorn Granlund
tg at gmplib.org
Thu Apr 4 13:11:27 CEST 2013
David Miller <davem at davemloft.net> writes:
> First mul_1, renamed again, now encoding the load scheduling. Only the
> 6c variant is new. Please time it. If it doesn't run at 3 c/l, then
> there are 2 simple things to try, indicated in a comment.
This gets the expected 3 cycles per limb on T4, and T3 gets
23 cycles per limb.
Good. There is method in the madness...
Numbers are (this is speed -C output):
T3 T4
mpn_modexact_1c_odd 30 26
mpn_mod_1s_4 30 4
mpn_bdiv_dbm1c 25 4
Thanks.
I guess that testing mpn_mod_1s_4 covers things like mpn_invert_limb?
Yes, but not the whole range. I have a separate test program (not
checked in for some reason).
I get bus errors in mpn_invert_limb during the testsuite run. It
looks like approx_tab is only byte aligned, so the lduh gets a
SIGBUS. Adding an "ALIGN(2)" above the table fixes that problem
for me.
Oops. I was lucky with alignment in my testing.
> Could you also please time the current copyi and copyd?
copi and copyd both get 2 cycles/limb on T4 and and 6 cycles/limb
on T3.
OK, so that's already L1d port saturation for T4.
> 5. Write the various addlsh, sublsh, rshadd, rshsub functions. Again,
> 2-way unrolling should be adequate in most cases. An exception is
> addlsh1_n which should be 4-way, using two chains of addxccc, making
> heavy use of carry flag register renaming, ideally reaching 3.25 c/l.
> We could make analogous sublsh1_n and rsblsh1_n, hitting 3.75 c/l.
> All other functions in this group would need sllx,srlx,or for
> shifting, adding 1 c/l to the add_n speed (since that was not
> issue-constrained...) and 1.5 c/l to the sub_n speed.
I have a set of addlsh* I was working on already, so I can handle
these as well.
I'd write just the general count addlsh_n/sublsh_n/rsblsh_n if I were
you. I.e., use sllx+srlx+or and a chain of addxccc. Special code for
count 1 could save almost a cycle for wide unrolling, but is hairy. (I
wrote the start of a 3.125 addlsh1_n, using 8-way unrolling, which I
will not finish now.)
We also at some point should look into ways to potentially use mpmul.
I was in the process of writing a test program that would mesaure it's
cost including all of the loads, stores, setup, and teardown so that
we can make more accurate analysis in this area in the future.
Sure. It will be faster for operands {up,un} x {vp,vn} where un >= vn
when vn > SOME_THRESHOLD. But even then, the best strategy will
sometimes be to pad {vp,vn} operand, sometimes to do a {up,vn} x {vp,vn}
slice first, then do {up,un-vn} x {vp,vn} using a discrete mul_basecase.
I expect sqr_basecase to never benefit from mpmul.
--
Torbjörn
More information about the gmp-devel
mailing list