ASM ?

Sun Jun 7 18:07:25 CEST 2009

Jim Langston <Jim.Langston at Sun.COM> writes:

  New to the alias - I'm looking into tuning the asm
  routines for SPARC, what is the best approach?

Identify the mpn routines that are most critical, and where you think
you can improve things compared to the current code.

Consider writing special code for different chips.

  How were the routines first developed?

That varies from file to file.

  Clearly, going through the README file shows that serious effort was
  put into each of the routines, did it start out as C code, then
  optimize through the assembly ?

Very few GMP assembly files started as compiled C.  Instead, they were
written from scratch.

Some general advice:

The most important functions in GMP are typically mpn_mul_basecase and
mpn_sqr_basecase.  If these are not implemented directly in assembly,
the C code will implement them in terms of mpn_addmul_1, or if it exists
mpn_addmul_2, and some other functions.  For details, please see the C
code in the mpn/generic directory.

Getting GMP to run fast on SPARC hardware isn't going to be easy.
64-bit SPARC is the worst instruction set architecture (ISA) for GMP of
all contemporary architectures.  The current code tries to work around
various shortcomings in the instruction set and implementation.  Proper
GMP speed requires good throughput of forming a 2n-bit product from two
n-bit operands, where n is the word size.  On a 64-bit machines, n must
be 64 bits, not 32 bits.

The current mpn/sparc64 code works around the instruction set
limitations by, in e.g. addmul_2 splitting operands into smaller pieces,
converting these into floating-point numbers on-the-fly, and the
multiplying using floating-point instructions, and finally converting th
fp numbers back to integer values.

If any Ultrasparc chips had had a pipelined integer multiply unit, we
could have use that instead of going through all the trouble with
converting forth and back to floating-point.

The current GMP code utilises Ultrasparc 1-4 reasonably well, but runs
awfully slowly on the more recent multi-core chips from Sun.  Unlike
Ultrasparc 1-4, these chips' floating-point unit are worse than their
integer units, and need something like 40 cycles for a single
floating-point operation.  I don't know if *any* function unit is
pipelined in the more recent Sun processors, but I assume one should go
back to using the integer unit for integer operations there.

The GMP code has not been optimised for Fujitsu's SPARC processors.
Perhaps these have better integer units than Ultrasparc.  I know they
have floating-point multiply-accumulate, which could be quite useful in
the absense of useful integer multiply instructions.  Theoretically, one
could reach 4 cycles/cross product in mpn_mul_basecase and
mpn_sqr_basecase if two multiply-accumulate could be sustained per
cycle.

I noticed that the Solaris assembler have the instructions umulxhi,
xmulx, and xmulxhi in its tables.  These sound a bit promising, if they
are (well) implemented in some chip.  Since mulx return the low part of
a 64x64 bit product, perhaps umulxhi return the most significant 64
bits?

Good luck with your optimisation project!

-- 
Torbjörn