New T3/T4 code batch
Torbjorn Granlund
tg at gmplib.org
Wed Apr 3 21:02:05 CEST 2013
David,
First mul_1, renamed again, now encoding the load scheduling. Only the
6c variant is new. Please time it. If it doesn't run at 3 c/l, then
there are 2 simple things to try, indicated in a comment.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sparct34-mul_1-3c.asm
Type: application/octet-stream
Size: 3057 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20130403/050df39b/attachment-0009.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sparct34-mul_1-6c.asm
Type: application/octet-stream
Size: 4042 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20130403/050df39b/attachment-0010.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sparct34-mul_1-8c.asm
Type: application/octet-stream
Size: 3700 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20130403/050df39b/attachment-0011.obj>
-------------- next part --------------
This is probably the same code as before for mul_2 and addmul_2. I
intend to check it in now. We really ought to trim the 0.25 c/l at some
point, it is a 7% speedup after all.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sparct34-aormul_2.asm
Type: application/octet-stream
Size: 5580 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20130403/050df39b/attachment-0012.obj>
-------------- next part --------------
The rest are division asm functions, updated to avoid constants in the
impldep insns. I suppose these are ready for check-in once you have
time to test them on real sand. It is enough to time them as
correctness test, I have made sure they run properly. (Incidentally,
not every function here is supported by tests/devel/try.c.)
I expect all of these, except mod_1_4, to suffer from the huge mul
delay.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sparct34-mode1o.asm
Type: application/octet-stream
Size: 2224 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20130403/050df39b/attachment-0013.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sparct34-mod_1_4.asm
Type: application/octet-stream
Size: 4602 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20130403/050df39b/attachment-0014.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sparct34-bdiv_dbm1c.asm
Type: application/octet-stream
Size: 1552 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20130403/050df39b/attachment-0015.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sparct34-dive_1.asm
Type: application/octet-stream
Size: 2976 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20130403/050df39b/attachment-0016.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sparct34-invert_limb.asm
Type: application/octet-stream
Size: 3466 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20130403/050df39b/attachment-0017.obj>
-------------- next part --------------
Could you also please time the current copyi and copyd?
And then gcd_1, using 'tune/speed -CD -s32-64 -t32 mpn_gcd_1'?
So what remains to be done for T4? And which ones would you want to
work on?
I'd suggest this prio order:
1. Write new addmul_1, aiming at 4.25 c/l. It is like mul_1 plus one
ldx,addxccc pair per limb, and one carry propagating addxc per
iteration. I'd suggest 4-way unrolling with single pointers; 2-way
should strain OoO too much to run well. We could reach
ceil((7*k+5)/2)/k cycles/limb for k-way unrolling, so 8-way would be
10% faster than 4-way. Feed-in for 8-way would require either a jump
table, or a binary search.
2. Write new submul_1, aiming at 4.75 c/l, using 4-way unrolling. We'd
reach ceil((8*k+5)/2)/k cycles/limb here, or 4.375 c/l for 8-way.
3. Write new new add_n, aiming at 3 c/l using 2-way unrolling and the
multi-pointer trick. The code would have just 10 insns in the loop,
and be cache port rather than decode/issue constrained.
4. Write new sub_n, aiming at 3 c/l using 2-way unrolling and the
multi-pointer trick. The code would have 12 insns in the loop,
and hit both the cache port and issue bandwidth.
5. Write the various addlsh, sublsh, rshadd, rshsub functions. Again,
2-way unrolling should be adequate in most cases. An exception is
addlsh1_n which should be 4-way, using two chains of addxccc, making
heavy use of carry flag register renaming, ideally reaching 3.25 c/l.
We could make analogous sublsh1_n and rsblsh1_n, hitting 3.75 c/l.
All other functions in this group would need sllx,srlx,or for
shifting, adding 1 c/l to the add_n speed (since that was not
issue-constrained...) and 1.5 c/l to the sub_n speed.
6. Write addmul_k, k > 2. At some point, we can go to 2-way unrolling
without losing speed (perhaps already for k=2, with some accumulation
rewriting). At some point, surely no later than for k=8, we could
skip unrlling altogether. We could gain 50% general speedup with
this approach,
7. Write mul_basecase, sqr_basecase, mullo_basecase, redc_1, redc_2...
These would inline addmul_1, addmul_2, and whatever larger addmuk_k
we've come up with. Use "overlapped software pipelining".
8. Anything else missing from the T4 column at gmplib.org/devel/asm.html.
--
Torbj?rn
More information about the gmp-devel
mailing list