[PATCH] Optimize 32-bit sparc T1 multiply routines.

Sun Jan 6 01:25:10 CET 2013

nisse at lysator.liu.se (Niels Möller) writes:

  You could always do the two's complement of one of the operands on the
  fly, and then use the same add with carry instructions as in add_n.

  I'm thinking aloud, so I'm sorry if I get this wrong, but I think it's
  best to handle the unlikely case of low zero limbs up front. Then it's a
  plain negate of the first non-zero limb, and a plain complement for the
  remaining limbs; the important thing here is that the negation generates
  no additional carries to propagate.

  So compared to add_n, you just get an additional xor with -1 in the loop
  (and not on the loop's critical path). I can't guess whether or not that
  will be visible in the execution time.

For sub_n, I suppose

    ldx
    ldx
    xnor (with %g0)
    addxcc
    stx

would be the right mix.  This should run in 2.5 + epsilon c/l, if
properly software pipelined.  4x should give 2.75 c/l, unless they stick
some pipeline bubbles for taken branches.

For add_n, things should run 0.5 c/l faster.

(I am assuming it is a 2-way pipeline.)

-- 
Torbjörn