ASM ?
Jim Langston
Jim.Langston at Sun.COM
Wed Jun 10 03:10:24 CEST 2009
Hi Torbjorn,
Thanks for giving me lots to chew on :-)
One question - you indicate that the SPARC asm was done
from scratch, you would not by chance know from whom ?
Jim
//////////////////////
Torbjorn Granlund wrote:
> Jim Langston <Jim.Langston at Sun.COM> writes:
>
> New to the alias - I'm looking into tuning the asm
> routines for SPARC, what is the best approach?
>
> Identify the mpn routines that are most critical, and where you think
> you can improve things compared to the current code.
>
> Consider writing special code for different chips.
>
> How were the routines first developed?
>
> That varies from file to file.
>
> Clearly, going through the README file shows that serious effort was
> put into each of the routines, did it start out as C code, then
> optimize through the assembly ?
>
> Very few GMP assembly files started as compiled C. Instead, they were
> written from scratch.
>
> Some general advice:
>
> The most important functions in GMP are typically mpn_mul_basecase and
> mpn_sqr_basecase. If these are not implemented directly in assembly,
> the C code will implement them in terms of mpn_addmul_1, or if it exists
> mpn_addmul_2, and some other functions. For details, please see the C
> code in the mpn/generic directory.
>
> Getting GMP to run fast on SPARC hardware isn't going to be easy.
> 64-bit SPARC is the worst instruction set architecture (ISA) for GMP of
> all contemporary architectures. The current code tries to work around
> various shortcomings in the instruction set and implementation. Proper
> GMP speed requires good throughput of forming a 2n-bit product from two
> n-bit operands, where n is the word size. On a 64-bit machines, n must
> be 64 bits, not 32 bits.
>
> The current mpn/sparc64 code works around the instruction set
> limitations by, in e.g. addmul_2 splitting operands into smaller pieces,
> converting these into floating-point numbers on-the-fly, and the
> multiplying using floating-point instructions, and finally converting th
> fp numbers back to integer values.
>
> If any Ultrasparc chips had had a pipelined integer multiply unit, we
> could have use that instead of going through all the trouble with
> converting forth and back to floating-point.
>
> The current GMP code utilises Ultrasparc 1-4 reasonably well, but runs
> awfully slowly on the more recent multi-core chips from Sun. Unlike
> Ultrasparc 1-4, these chips' floating-point unit are worse than their
> integer units, and need something like 40 cycles for a single
> floating-point operation. I don't know if *any* function unit is
> pipelined in the more recent Sun processors, but I assume one should go
> back to using the integer unit for integer operations there.
>
> The GMP code has not been optimised for Fujitsu's SPARC processors.
> Perhaps these have better integer units than Ultrasparc. I know they
> have floating-point multiply-accumulate, which could be quite useful in
> the absense of useful integer multiply instructions. Theoretically, one
> could reach 4 cycles/cross product in mpn_mul_basecase and
> mpn_sqr_basecase if two multiply-accumulate could be sustained per
> cycle.
>
> I noticed that the Solaris assembler have the instructions umulxhi,
> xmulx, and xmulxhi in its tables. These sound a bit promising, if they
> are (well) implemented in some chip. Since mulx return the low part of
> a 64x64 bit product, perhaps umulxhi return the most significant 64
> bits?
>
> Good luck with your optimisation project!
>
>
--
/////////////////////////////////////////////
Jim Langston
Sun Microsystems, Inc.
(513) 702-4741 (Cell)
(877) 854-5583 (AccessLine)
AIM: jl9594
jim.langston at sun.com
More information about the gmp-devel
mailing list