GMPBench for Sun Fire T2000

Tue Apr 25 21:29:04 CEST 2006

Timothy Jacobson <Timothy.Jacobson at Sun.COM> writes:

  As a Sun Engineer, this doesn't surprise me.  The architecture is very
  different for this kind of server than a single cpu desktop machine.
  Machines like the Sun Fire T2000 are intended for throughput with many
  threads running.  There is a trade off for having fast multi threaded
  processes running parallel vs. performance for a single cpu process.
  They were designed to be web hosting and data base machines, not
  scientific computing engines.  When all 32 virtual processors are in
  motion, this outperforms any single cpu desktop you can find.

Granted these processors are not intended for scientific computing,
but isn't more than 100x worse performance than an Opteron a bit
embarrassing?

  The other problem is that the performance in the GMP code is mostly gnu
  specific inline code for x86 processors.

Pardon?

That has never been true.  Until GMP 2.0 10 years ago, GMP relied on
inline asm, but it was by no means x86 specific.

  I have worked on a patch that
  produces fantastic numbers for Sun Opteron machines.

Wow, this is impressive!  I imagine "fantastic" here is relative to
the T1 performance?  :-)

But to be serious, are you trying to say that GMP 4.2 doesn't work for
Opteron machines running Solaris, or are you implying that the
performance is bad?  (If performance is "bad", which adjective is
appropriate for the 150 times slower T1 machines?)

  I have given some
  thought to writing a threaded version of the GMP library.  That is not
  an easy task.  Maybe not possible.

I'd say to forget it.

It will give no or poor speedup for sensibly large numbers.  Only for
enormous numbers, there is any chance of seeing useful speedup.

But starting with a handicap of 150x, I would suggest that no speedup
is enough to make the resulting performance truly exciting.

  By the way, thanks for running the benchmark.  At Sun, we welcome the
  input.  We are always looking for ways to improve our products and I
  will pass this information on to other engineers to look at.

Want input?

Here are some suggestions:

Actually implement the lurking umulxhi instruction (which is output by
"strings /usr/ccs/bin/as | grep mulx") for generating the upper part
of a 64 bit unsigned multiply.

Add instructions for 64-bit add-with-carry (and subtract-with-borrow).
Current add-with-carry instructions set carry from bit 31, not from
bit 63.  If that is too complex, implement the existing conditional
moves so that we can sustain one per cycle.

Make sure mulx and umulxhi are fully pipelined.  The stalling
UltraSPARC 1,2 and the semi-stalling UltraSPARC 3 is not good enough.
Most competitors have pipelined integer multipliers since many years.
(The only real exception here is IBM's 64-bit powerpc processors, but
they give decent performance anyway.)

Then use OoO execution, do proper register renaming of both integer
registers and the carry flag register, and last but not least ramp
clock frequency to at least 3 GHz.

Since I am polite (which should be obvious from the reply above) I
will not publish the UltraSPARC T1 gmpbench scores at
http://swox.com/gmp/gmpbench.html.  :-)

-- 
Torbjörn