T3/T3 mul_2 and addmul_2

Fri Mar 8 21:23:55 CET 2013

David Miller <davem at davemloft.net> writes:

  Seems to work fine, here are some speed runs:

  davem at patience:~/src/GMP/HG/build-sparc64-ultrasparct4/tune$ ./speed -C -s 32-64 -t 2 mpn_mul_2
  overhead 6.06 cycles, precision 10000 units of 3.51e-10 secs, CPU freq 2847.34 MHz
              mpn_mul_2
  32             7.7564
  34             7.5137
  36             7.7395
  38             7.4933
  40             7.7088
  42             7.4566
  44             7.7251
  46             7.4409
  48             7.6494
  50             7.4467
  52             7.6216
  54             7.4067
  56             7.5927
  58             7.3858
  60             7.6043
  62             7.4039
  64             7.5909
  davem at patience:~/src/GMP/HG/build-sparc64-ultrasparct4/tune$ ./speed -C -s 32-64 -t 2 mpn_addmul_2
  overhead 6.04 cycles, precision 10000 units of 3.51e-10 secs, CPU freq 2847.46 MHz
           mpn_addmul_2
  32             8.5508
  34             8.4938
  36             8.6271
  38             8.4753
  40             8.4031
  42             8.4429
  44             8.3950
  46             8.4700
  48             8.3910
  50             8.4448
  52             8.4181
  54             8.4581
  56             8.3765
  58             8.4279
  60             8.3778
  62             8.4242
  64             8.3734

Thanks!

OK, better speed but not great.  These numbers mean that mul_2 runs at
about 3.5 c/l and addmul_2 runs at about 4 c/l.  I had hoped for 3 and
3.5 c/l, respectively...

The pipelining of mulx is not great, the results are used too soon.  The
mulx v0 are just 3 instructons from its use.

The checked in code (still with the bug, will fix soon) is slightly
different, but with the same basic instruction schedule.  I moved the
branch point to save some insn in the wind down code.

One might try to move the addcc,addxccc,addxc sequence before each loN
label to either after the mulx,umulxhi at the label, or mixed in with
them.  I'd try mulx,addcc,umulxhi,addxccc,addxc and
mulx,umulxhi,addcc,addxccc,addxc.

If you want to play with this, please start with the checked in code
(you'll need to fresh configure.ac to allow the aormul_2 'multifunc'
name).  The first thing to try is its speed compared to the code you
timed above.

For an decode constrained pipeline with good multiply throughput we will
need to go to at least addmul_4 as the main multiply workhorse.  That
could surely run at < 3 c/l on T4.  For sqr_basecase, addmul_2 is
currently the widest primitive we can make use of, so it will still be
important.

-- 
Torbjörn