ASM ?

Wed Jun 10 03:10:24 CEST 2009

Hi Torbjorn,

Thanks for giving me lots to chew on :-)

One question - you indicate that the SPARC asm was done
from scratch, you would not by chance know from whom ?

Jim

//////////////////////

Torbjorn Granlund wrote:
> Jim Langston <Jim.Langston at Sun.COM> writes:
>
>   New to the alias - I'm looking into tuning the asm
>   routines for SPARC, what is the best approach?
>
> Identify the mpn routines that are most critical, and where you think
> you can improve things compared to the current code.
>
> Consider writing special code for different chips.
>
>   How were the routines first developed?
>
> That varies from file to file.
>
>   Clearly, going through the README file shows that serious effort was
>   put into each of the routines, did it start out as C code, then
>   optimize through the assembly ?
>
> Very few GMP assembly files started as compiled C.  Instead, they were
> written from scratch.
>
> Some general advice:
>
> The most important functions in GMP are typically mpn_mul_basecase and
> mpn_sqr_basecase.  If these are not implemented directly in assembly,
> the C code will implement them in terms of mpn_addmul_1, or if it exists
> mpn_addmul_2, and some other functions.  For details, please see the C
> code in the mpn/generic directory.
>
> Getting GMP to run fast on SPARC hardware isn't going to be easy.
> 64-bit SPARC is the worst instruction set architecture (ISA) for GMP of
> all contemporary architectures.  The current code tries to work around
> various shortcomings in the instruction set and implementation.  Proper
> GMP speed requires good throughput of forming a 2n-bit product from two
> n-bit operands, where n is the word size.  On a 64-bit machines, n must
> be 64 bits, not 32 bits.
>
> The current mpn/sparc64 code works around the instruction set
> limitations by, in e.g. addmul_2 splitting operands into smaller pieces,
> converting these into floating-point numbers on-the-fly, and the
> multiplying using floating-point instructions, and finally converting th
> fp numbers back to integer values.
>
> If any Ultrasparc chips had had a pipelined integer multiply unit, we
> could have use that instead of going through all the trouble with
> converting forth and back to floating-point.
>
> The current GMP code utilises Ultrasparc 1-4 reasonably well, but runs
> awfully slowly on the more recent multi-core chips from Sun.  Unlike
> Ultrasparc 1-4, these chips' floating-point unit are worse than their
> integer units, and need something like 40 cycles for a single
> floating-point operation.  I don't know if *any* function unit is
> pipelined in the more recent Sun processors, but I assume one should go
> back to using the integer unit for integer operations there.
>
> The GMP code has not been optimised for Fujitsu's SPARC processors.
> Perhaps these have better integer units than Ultrasparc.  I know they
> have floating-point multiply-accumulate, which could be quite useful in
> the absense of useful integer multiply instructions.  Theoretically, one
> could reach 4 cycles/cross product in mpn_mul_basecase and
> mpn_sqr_basecase if two multiply-accumulate could be sustained per
> cycle.
>
> I noticed that the Solaris assembler have the instructions umulxhi,
> xmulx, and xmulxhi in its tables.  These sound a bit promising, if they
> are (well) implemented in some chip.  Since mulx return the low part of
> a 64x64 bit product, perhaps umulxhi return the most significant 64
> bits?
>
> Good luck with your optimisation project!
>
>   

-- 
/////////////////////////////////////////////

Jim Langston
Sun Microsystems, Inc.

(513) 702-4741 (Cell)
(877) 854-5583 (AccessLine)
AIM: jl9594
jim.langston at sun.com