[PATCH 3 of 3] Add MIPS32R1 MADDU-based *mul_1.asm functions

Fri Sep 13 12:41:02 UTC 2019

T> What is MDU?
T>   It is faster on all tried MIPS32R1/R2/R5 CPUs (see the c/l table) and is
T>   expected to be fast with any pipelined MDU. So-called Area-Efficient MDU
T>   (optional on some MCUs) will run it *much* slower (~3x for addmul_1).
T> What is 3x slower than what?

multiply-divide unit.

they (MTI, IMG now Wave Computing) have [at least] four MDU kinds in their portfolio:

1) the only non-pipelined "area-efficient" option, circa ~30 cycles with no early exit for multiply (or multiply-add -- does not matter here and below);
2) 32x16, can issue 32x16 multiply-add every cycle or one 32x32 op every second cycle;
3) 32x32 high-performance non-DSP;
4) 32x32 DSP.

3 and 4 are the same performance-wise per specs and as evident from 24Kc (no DSP ASE) and 24KEc (implements DSP ASE) results.

naturally, three non-pipelined multiply ops per limb will be slower than a single multiply from the MIPS-II code on cores with MDU 1.

T> I took a quick look at the code.  Do you use madd/msub for accumulation
T> here, while actual multiplication is done by multu?  As MIPS lacks

not quite so.

as MDU 2+ is pipelined, the "multu $xx, 1" idiom is used to quickly reset the accumulator:
this is clearly shorter and faster than mthi/mtlo pair on in-order cores and *absolutely* critical on P5600 (and likely 74K* with a similar pipeline).

but the choice of maddu/multu depends on the subroutine cause the order matters, addmul_1 just happens to be more flexible than submul_1 regarding this.

T> It's long since I did any substantial work with MIPS, but it would
T> appear that, at least for addmul_1, madd could be used also for
T> multiplication.  One should of course avoid creating a slow recurrency
T> path.

it is all about order: wrong placement of carry accumulation drops P5600 performance to 14 c/l so it is always at the end of the sequence.

doing "multu *up, vl; maddu *rp, 1" is slightly faster than "multu *rp, 1; maddu *up, v1" on P5600, especially for N=1,2.