ASM ?

Wed Jun 10 10:40:08 CEST 2009

Jim Langston <Jim.Langston at Sun.COM> writes:

  One question - you indicate that the SPARC asm was done
  from scratch, you would not by chance know from whom ?

I believe I wrote it all.

I targeted the UltraSPARC 1&2 pipeline, and have not updated the code
for the UltraSPARC 3 pipeline.  Every function runs worse or much worse
on UltraSPARC 3.  The UltraSPARC 3 pipeline is slower, but most of the
GMP relative US 3 / US 2 slowness could be fixed with some rescheduling.

Mind you, UltraSPARC will remain the worst performer of all contemporary
processors.  Perhaps "rock", with its planned umulxhi instruction and
64-bit extended addition and subtraction could catch up?

If I were you, this is what I'd do:

1. Reschedule addmul_1 and mul_1 for US 3's 4 cycles fp pipeline.  (The
code is written for the US 1&2 3 cycle fp pipeline.)  Make sure the code
doesn't run worse on US 1&2.

2. Fix the lshift and rshift code.  Perhaps it is enough to remove the
"fanop"?

3. Consider writing mul_basecase and sqr_basecase using inlined addmul_2
as the engine.  Overlap the wind-down code of outer loop iteration k
with the feed-in code of iteration (k+1).  This isn't going to be easy,
but it could give speedup of 2x for mul_basecase and 3x for
sqr_basecase.  See the x86_64 for an example, which has everything
except the overlapped wind-down/feed-in.

4. Create a special configuration for T1 and T2, which provides special
addmul_1, mul_1, addmul_2, and submul_1 code, perhaps simply grabbing
the C code from generic.

-- 
Torbjörn