[PATCH] Optimize 32-bit sparc T1 multiply routines.

Fri Jan 4 14:54:15 CET 2013

I took a brief look at the definition of these instructions.  It is
clear that they did not consult an expert in the area.  They also added
DES instructions now (in 2012).

They added a few useful instructions, addxc/addxcc and umulxhi.  The
former is a 64-bit addition with useful carry in and out (previous
64-bit add instructions read the carry bit which was set from the middle
bit by a previous add...).

They forgot to add corresponding subxc/subxcc instructions, creating a
very unfortunate timing asymmetry.

  The tradeoff of when mpmul is faster than a flat-out mulx/umulxhi loop
  is beyond 2x2 limbs, so I don't see any value in looking into that
  just yet.

Did you add umulxhi use in your patch from a few days ago?  I've seen
that instruction in the Solaris assembler for > 10 years, but I have
noticed that no chip supports it.  Has that now changed?

Is that instruction well-implemented on any chip?  What throughput does
a series of umulxhi get, if they are independent?  Unless the pipeline
is poorly designed, one can usually make a mul_bascase (or even say an
addmul_2) which runs at about the mulx/umulxhi bandwidth.  But that
requires careful software pipelining.

For example, how well does something like this run?

        .text
        .register       %g2, #scratch
        .register       %g3, #scratch
.globl main
        .type   main, #function
        .align  32
main:
        save
        sethi   %hi(1500000000), %i0
        nop
        nop
        nop
        nop
        nop
        nop

1:      umulxhi %g5, %g5, %g1
        mulx    %g5, %g5, %g2
        umulxhi %g5, %g5, %g3
        mulx    %g5, %g5, %g4
        umulxhi %g5, %g5, %i1
        mulx    %g5, %g5, %i2
        umulxhi %g5, %g5, %i3
        mulx    %g5, %g5, %i4
        brnz,a  %i0, 1b
        addx    %i0, -1, %i0

        ret
        restore

Please change "1500000000" in the file by the actual CPU frequency.
Ideally, this would then run in 4 seconds, but 8 seconds is not bad
either.  If it is slower, it will not compete with x86-64 CPUs.

But what counts when considering "mpmul", is relative performance, of
course.  If mpmul needs many cycles for accumulating a 128-bit product,
then an mulx/umulxhi pair will also surely be slow, since odds are low
that that they use the same multiply hardware.

  There's a lot of setup and teardown associated with using mpmul
  because it uses several register windows and some of the floating
  point registers to hold the entire set of inputs, and to provide the
  result.

A really silly design, since it forces a non-balanced instruction mix.
Few pipelines mind some loads and stores between arithmetic, but with
these instructions one will delay the start of arithmetic for tens of
cycles while performing lots of loads.  (For modexp, I assume one can
stay in registers, making this overhead small when using a large
exponent, such as RSA signing/decryption.)

  That's why realistically I'll probably only use mpmul for 3x3 and
  larger.

I wouldn't be surprised if mpmul would never beat truly well-optimised
discrete code.

-- 
Torbjörn