tg at gmplib.org
Wed Jun 10 10:40:08 CEST 2009
Jim Langston <Jim.Langston at Sun.COM> writes:
One question - you indicate that the SPARC asm was done
from scratch, you would not by chance know from whom ?
I believe I wrote it all.
I targeted the UltraSPARC 1&2 pipeline, and have not updated the code
for the UltraSPARC 3 pipeline. Every function runs worse or much worse
on UltraSPARC 3. The UltraSPARC 3 pipeline is slower, but most of the
GMP relative US 3 / US 2 slowness could be fixed with some rescheduling.
Mind you, UltraSPARC will remain the worst performer of all contemporary
processors. Perhaps "rock", with its planned umulxhi instruction and
64-bit extended addition and subtraction could catch up?
If I were you, this is what I'd do:
1. Reschedule addmul_1 and mul_1 for US 3's 4 cycles fp pipeline. (The
code is written for the US 1&2 3 cycle fp pipeline.) Make sure the code
doesn't run worse on US 1&2.
2. Fix the lshift and rshift code. Perhaps it is enough to remove the
3. Consider writing mul_basecase and sqr_basecase using inlined addmul_2
as the engine. Overlap the wind-down code of outer loop iteration k
with the feed-in code of iteration (k+1). This isn't going to be easy,
but it could give speedup of 2x for mul_basecase and 3x for
sqr_basecase. See the x86_64 for an example, which has everything
except the overlapped wind-down/feed-in.
4. Create a special configuration for T1 and T2, which provides special
addmul_1, mul_1, addmul_2, and submul_1 code, perhaps simply grabbing
the C code from generic.
More information about the gmp-devel