ASM ?
    Torbjorn Granlund 
    tg at gmplib.org
       
    Wed Jun 10 10:40:08 CEST 2009
    
    
  
Jim Langston <Jim.Langston at Sun.COM> writes:
  One question - you indicate that the SPARC asm was done
  from scratch, you would not by chance know from whom ?
  
I believe I wrote it all.
I targeted the UltraSPARC 1&2 pipeline, and have not updated the code
for the UltraSPARC 3 pipeline.  Every function runs worse or much worse
on UltraSPARC 3.  The UltraSPARC 3 pipeline is slower, but most of the
GMP relative US 3 / US 2 slowness could be fixed with some rescheduling.
Mind you, UltraSPARC will remain the worst performer of all contemporary
processors.  Perhaps "rock", with its planned umulxhi instruction and
64-bit extended addition and subtraction could catch up?
If I were you, this is what I'd do:
1. Reschedule addmul_1 and mul_1 for US 3's 4 cycles fp pipeline.  (The
code is written for the US 1&2 3 cycle fp pipeline.)  Make sure the code
doesn't run worse on US 1&2.
2. Fix the lshift and rshift code.  Perhaps it is enough to remove the
"fanop"?
3. Consider writing mul_basecase and sqr_basecase using inlined addmul_2
as the engine.  Overlap the wind-down code of outer loop iteration k
with the feed-in code of iteration (k+1).  This isn't going to be easy,
but it could give speedup of 2x for mul_basecase and 3x for
sqr_basecase.  See the x86_64 for an example, which has everything
except the overlapped wind-down/feed-in.
4. Create a special configuration for T1 and T2, which provides special
addmul_1, mul_1, addmul_2, and submul_1 code, perhaps simply grabbing
the C code from generic.
-- 
Torbjörn
    
    
More information about the gmp-devel
mailing list