T3/T3 mul_2 and addmul_2
Torbjorn Granlund
tg at gmplib.org
Fri Mar 8 21:23:55 CET 2013
David Miller <davem at davemloft.net> writes:
Seems to work fine, here are some speed runs:
davem at patience:~/src/GMP/HG/build-sparc64-ultrasparct4/tune$ ./speed -C -s 32-64 -t 2 mpn_mul_2
overhead 6.06 cycles, precision 10000 units of 3.51e-10 secs, CPU freq 2847.34 MHz
mpn_mul_2
32 7.7564
34 7.5137
36 7.7395
38 7.4933
40 7.7088
42 7.4566
44 7.7251
46 7.4409
48 7.6494
50 7.4467
52 7.6216
54 7.4067
56 7.5927
58 7.3858
60 7.6043
62 7.4039
64 7.5909
davem at patience:~/src/GMP/HG/build-sparc64-ultrasparct4/tune$ ./speed -C -s 32-64 -t 2 mpn_addmul_2
overhead 6.04 cycles, precision 10000 units of 3.51e-10 secs, CPU freq 2847.46 MHz
mpn_addmul_2
32 8.5508
34 8.4938
36 8.6271
38 8.4753
40 8.4031
42 8.4429
44 8.3950
46 8.4700
48 8.3910
50 8.4448
52 8.4181
54 8.4581
56 8.3765
58 8.4279
60 8.3778
62 8.4242
64 8.3734
Thanks!
OK, better speed but not great. These numbers mean that mul_2 runs at
about 3.5 c/l and addmul_2 runs at about 4 c/l. I had hoped for 3 and
3.5 c/l, respectively...
The pipelining of mulx is not great, the results are used too soon. The
mulx v0 are just 3 instructons from its use.
The checked in code (still with the bug, will fix soon) is slightly
different, but with the same basic instruction schedule. I moved the
branch point to save some insn in the wind down code.
One might try to move the addcc,addxccc,addxc sequence before each loN
label to either after the mulx,umulxhi at the label, or mixed in with
them. I'd try mulx,addcc,umulxhi,addxccc,addxc and
mulx,umulxhi,addcc,addxccc,addxc.
If you want to play with this, please start with the checked in code
(you'll need to fresh configure.ac to allow the aormul_2 'multifunc'
name). The first thing to try is its speed compared to the code you
timed above.
For an decode constrained pipeline with good multiply throughput we will
need to go to at least addmul_4 as the main multiply workhorse. That
could surely run at < 3 c/l on T4. For sqr_basecase, addmul_2 is
currently the widest primitive we can make use of, so it will still be
important.
--
Torbjörn
More information about the gmp-devel
mailing list