New T3/T4 code batch

Wed Apr 3 21:02:05 CEST 2013

David,

First mul_1, renamed again, now encoding the load scheduling.  Only the
6c variant is new.  Please time it.  If it doesn't run at 3 c/l, then
there are 2 simple things to try, indicated in a comment.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: sparct34-mul_1-3c.asm
Type: application/octet-stream
Size: 3057 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20130403/050df39b/attachment-0009.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sparct34-mul_1-6c.asm
Type: application/octet-stream
Size: 4042 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20130403/050df39b/attachment-0010.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sparct34-mul_1-8c.asm
Type: application/octet-stream
Size: 3700 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20130403/050df39b/attachment-0011.obj>
-------------- next part --------------

This is probably the same code as before for mul_2 and addmul_2.  I
intend to check it in now.  We really ought to trim the 0.25 c/l at some
point, it is a 7% speedup after all.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: sparct34-aormul_2.asm
Type: application/octet-stream
Size: 5580 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20130403/050df39b/attachment-0012.obj>
-------------- next part --------------

The rest are division asm functions, updated to avoid constants in the
impldep insns.  I suppose these are ready for check-in once you have
time to test them on real sand.  It is enough to time them as
correctness test, I have made sure they run properly.  (Incidentally,
not every function here is supported by tests/devel/try.c.)

I expect all of these, except mod_1_4, to suffer from the huge mul
delay.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: sparct34-mode1o.asm
Type: application/octet-stream
Size: 2224 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20130403/050df39b/attachment-0013.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sparct34-mod_1_4.asm
Type: application/octet-stream
Size: 4602 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20130403/050df39b/attachment-0014.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sparct34-bdiv_dbm1c.asm
Type: application/octet-stream
Size: 1552 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20130403/050df39b/attachment-0015.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sparct34-dive_1.asm
Type: application/octet-stream
Size: 2976 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20130403/050df39b/attachment-0016.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sparct34-invert_limb.asm
Type: application/octet-stream
Size: 3466 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20130403/050df39b/attachment-0017.obj>
-------------- next part --------------

Could you also please time the current copyi and copyd?
And then gcd_1, using 'tune/speed -CD -s32-64 -t32 mpn_gcd_1'?

So what remains to be done for T4?  And which ones would you want to
work on?

I'd suggest this prio order:

1. Write new addmul_1, aiming at 4.25 c/l.  It is like mul_1 plus one
   ldx,addxccc pair per limb, and one carry propagating addxc per
   iteration.  I'd suggest 4-way unrolling with single pointers; 2-way
   should strain OoO too much to run well.  We could reach
   ceil((7*k+5)/2)/k cycles/limb for k-way unrolling, so 8-way would be
   10% faster than 4-way.  Feed-in for 8-way would require either a jump
   table, or a binary search.

2. Write new submul_1, aiming at 4.75 c/l, using 4-way unrolling.  We'd
   reach ceil((8*k+5)/2)/k cycles/limb here, or 4.375 c/l for 8-way.

3. Write new new add_n, aiming at 3 c/l using 2-way unrolling and the
   multi-pointer trick.  The code would have just 10 insns in the loop,
   and be cache port rather than decode/issue constrained.

4. Write new sub_n, aiming at 3 c/l using 2-way unrolling and the
   multi-pointer trick.  The code would have 12 insns in the loop,
   and hit both the cache port and issue bandwidth.

5. Write the various addlsh, sublsh, rshadd, rshsub functions.  Again,
   2-way unrolling should be adequate in most cases.  An exception is
   addlsh1_n which should be 4-way, using two chains of addxccc, making
   heavy use of carry flag register renaming, ideally reaching 3.25 c/l.
   We could make analogous sublsh1_n and rsblsh1_n, hitting 3.75 c/l.
   All other functions in this group would need sllx,srlx,or for
   shifting, adding 1 c/l to the add_n speed (since that was not
   issue-constrained...) and 1.5 c/l to the sub_n speed.

6. Write addmul_k, k > 2.  At some point, we can go to 2-way unrolling
   without losing speed (perhaps already for k=2, with some accumulation
   rewriting).  At some point, surely no later than for k=8, we could
   skip unrlling altogether.  We could gain 50% general speedup with
   this approach,

7. Write mul_basecase, sqr_basecase, mullo_basecase, redc_1, redc_2...
   These would inline addmul_1, addmul_2, and whatever larger addmuk_k
   we've come up with.  Use "overlapped software pipelining".

8. Anything else missing from the T4 column at gmplib.org/devel/asm.html.

-- 
Torbj?rn