proposal for enhancement to configure, re UltraSPARC-T1

Torbjorn Granlund tg-this-will-bounce-but-I-am-subscribed-to-the-list-honest at swox.com
Mon Dec 11 12:17:23 CET 2006


nisse at lysator.liu.se (Niels Möller) writes:

  > Since usage of the floating point registers is not recommended on
  > UltraSPARC-T1 based systems,
  
  Why is it not recommended? Are floating operations in general very
  slow on this processor?
  
"Very slow" is an understatement.

Two cores share one FPU.  That shared FPU is not pipelined, and one
operation takes more than 30 cycles IIRC.

Furthermore, since the FPU is shared, when one core uses he FPU, the
other core stalls.

A curious design.  Sun has always been the market leader for slow
CPUs, and that lead was further extended with these new CPUs.

The current GMPbench score is something like 300.  With the FPU stuff
removed from GMP, it gets radically boosted to much less than 1000.

Athlon64 has about 8000.

Hmm, Scalable Processor ARChitecture.  Scale down by 10x.  :-)

  > Using the generic integer functions in GMP yields about five times
  > better performance than using the .asm functions on UltraSPARC-T1
  > based systems (i.e. an improvement of about 400%).
  
  When you say "generic integer functions", does that boil down to using
  the sparc v9 instruction mulx? (Which takes two 64 bit integers and
  produces the least significant 64 bits of their product).
  
Yes, the lame mulx is the only alternative on the SPARC v9.

To make things worse, it is not properly implemented on any Sun SPARC
chips.  The US 1 and US 2 stalls the entire CPU for 36 cycles.  The US
3 and US 4 don't stall the CPU, but the multiply unit is capable of
dealing with just one mulx at a time.  And its latency is something
like 6 cycles.  Ever heard of "pipelining"?

  The problem with mulx when you need *all* bits of the product (which
  you almost always do in GMP), then you have to restrict the arguments
  to 32 bits. To get a full 128 bit product, you have to use mulx four
  times shifting around the different halves of the inputs (or three
  mulx, theoretically, but with even more overhead).
  
  As far as I understand the GMP sparc issues, that's the reason for the
  somewhat complicated floating-point using code; it was the only way to
  get a serious performance gain over the older 32-bit code. This is in
  contrast to other architectures, where you typically get a four time
  speedup from using the more powerful 64-bit integer multiplication.
  
  Have you tried any benchmarks comparing 32-bit and 64-bit bignum
  performance on the T1, and on other ultrasparcsq?
  
  Does the T1, or other recent ultrasparcs, support any other
  instructions that can do a full multiplication (two 64 bit inputs, one
  128 bit output) with reasonable performance?
  
If one does "strings" on the Slowaris assembler, something like
mulxuhi shows up, so there is hope.  Or maybe not.

In the meantime, Sun is willing to sell you a very expensive PCI card
for RSA.  It costs you like a handful of PCs, and performs like half
of one.

The problems with the SPARC v9 ISA is not limited to the lack of
proper integer multiply support.  There is also no reasonably
efficient way of computing the carry from a 64-bit add or the borrow
from a 64-bit subtract.  The carry flag exists, but is set from the
carry in the middle of the word!

I wouldn't describe writing code for SPARC v9 as "programming", but as
"working around deficiencies".

The RISCs were supposed to give us cleaner, more usable instruction
sets than x86, weren't they?  :-)

PS. I all fairness, PowerPC and the Alpha ISAs are far superior.  RIP.

-- 
Torbjörn


More information about the gmp-devel mailing list