New T3/T4 code batch

Thu Apr 4 03:45:48 CEST 2013

From: Torbjorn Granlund <tg at gmplib.org>
Date: Wed, 03 Apr 2013 21:02:05 +0200

[ Note that all of these timings are based upon what you checked into
  the tree, rather than what was in the attachments. ]

> First mul_1, renamed again, now encoding the load scheduling.  Only the
> 6c variant is new.  Please time it.  If it doesn't run at 3 c/l, then
> there are 2 simple things to try, indicated in a comment.

This gets the expected 3 cycles per limb on T4, and T3 gets
23 cycles per limb.

> This is probably the same code as before for mul_2 and addmul_2.  I
> intend to check it in now.  We really ought to trim the 0.25 c/l at some
> point, it is a 7% speedup after all.

mul_2 on T3 runs at 22.5 cycles/limb and on T4 it's 3.25 cycles/limb

addmul_2 on T3 runs at 23.5 cycles/limb and on T4 it's 3.75 cycles/limb

> The rest are division asm functions, updated to avoid constants in the
> impldep insns.  I suppose these are ready for check-in once you have
> time to test them on real sand.  It is enough to time them as
> correctness test, I have made sure they run properly.  (Incidentally,
> not every function here is supported by tests/devel/try.c.)
> 
> I expect all of these, except mod_1_4, to suffer from the huge mul
> delay.

Numbers are (this is speed -C output):

			T3	T4
mpn_modexact_1c_odd	30	26
mpn_mod_1s_4		30	4
mpn_bdiv_dbm1c		25	4

I guess that testing mpn_mod_1s_4 covers things like mpn_invert_limb?

I get bus errors in mpn_invert_limb during the testsuite run.  It
looks like approx_tab is only byte aligned, so the lduh gets a
SIGBUS.  Adding an "ALIGN(2)" above the table fixes that problem
for me.

> Could you also please time the current copyi and copyd?

copi and copyd both get 2 cycles/limb on T4 and and 6 cycles/limb
on T3.

> And then gcd_1, using 'tune/speed -CD -s32-64 -t32 mpn_gcd_1'?

T4 gets 6 cycles, and T3 gets 10 cycles.

> 3. Write new new add_n, aiming at 3 c/l using 2-way unrolling and the
>    multi-pointer trick.  The code would have just 10 insns in the loop,
>    and be cache port rather than decode/issue constrained.
> 
> 4. Write new sub_n, aiming at 3 c/l using 2-way unrolling and the
>    multi-pointer trick.  The code would have 12 insns in the loop,
>    and hit both the cache port and issue bandwidth.

I can work on these.

> 5. Write the various addlsh, sublsh, rshadd, rshsub functions.  Again,
>    2-way unrolling should be adequate in most cases.  An exception is
>    addlsh1_n which should be 4-way, using two chains of addxccc, making
>    heavy use of carry flag register renaming, ideally reaching 3.25 c/l.
>    We could make analogous sublsh1_n and rsblsh1_n, hitting 3.75 c/l.
>    All other functions in this group would need sllx,srlx,or for
>    shifting, adding 1 c/l to the add_n speed (since that was not
>    issue-constrained...) and 1.5 c/l to the sub_n speed.

I have a set of addlsh* I was working on already, so I can handle
these as well.

We also at some point should look into ways to potentially use mpmul.
I was in the process of writing a test program that would mesaure it's
cost including all of the loads, stores, setup, and teardown so that
we can make more accurate analysis in this area in the future.

Thanks!