[PATCH] Optimize 32-bit sparc T1 multiply routines.

Fri Jan 4 15:17:11 CET 2013

bodrato at mail.dm.unipi.it writes:

  Does the immediate means that, to write e.g. sqr_basecase (it should be
  far simpler than writing mul_basecase), you need a branch for each
  different N?

Since you have to preload a (weird) set of hardwired registers, one will
really need special code for every N.

I missed the cycle table.  Here I've added data for an AMD K10, the
current fastest CPU for mul_basecase:

1         79            21
2         84            32
3         90            53
4         98            70
5         108           95
6         120          126
7         134          160
8         150          208
9         168          260
10        188          303
11        210          357
12        234          411
13        260          490
14        288          543
15        318          617
16        350          686

These are very good numbers for the new SPARC chip, except that the
overhead is terrible, as expected from the design.  IIUC, you did not
account for the initial loads and final stores, which we will need.  I
expect them to add 3n/2 to 3n cycles, depending on the pipeline
characteristics.

The Oracle manuals recommend that one loops around mpmul checking for
intermittent hardware failure, whatever that means.

  (Almost) exactly 78 + N + N*N, a big latency plus a quadratic algorithm.
  When 2n > 26, we have (78 + 2n + 2n*2n) > 3*(78 + n + n*n)... I'm curious
  to see where the Karatsuba threshold will be.

Our current Karatsuba code (evaluating in 0, -1, oo) will suffer from
the forgotten subtraction instructions.  Evaluating in 0, +1, oo might
be better...

-- 
Torbjörn