[PATCH 3 of 3] Add MIPS32R1 MADDU-based *mul_1.asm functions
info at mobile-stream.com
info at mobile-stream.com
Fri Sep 13 12:41:02 UTC 2019
T> What is MDU?
T> It is faster on all tried MIPS32R1/R2/R5 CPUs (see the c/l table) and is
T> expected to be fast with any pipelined MDU. So-called Area-Efficient MDU
T> (optional on some MCUs) will run it *much* slower (~3x for addmul_1).
T> What is 3x slower than what?
multiply-divide unit.
they (MTI, IMG now Wave Computing) have [at least] four MDU kinds in their portfolio:
1) the only non-pipelined "area-efficient" option, circa ~30 cycles with no early exit for multiply (or multiply-add -- does not matter here and below);
2) 32x16, can issue 32x16 multiply-add every cycle or one 32x32 op every second cycle;
3) 32x32 high-performance non-DSP;
4) 32x32 DSP.
3 and 4 are the same performance-wise per specs and as evident from 24Kc (no DSP ASE) and 24KEc (implements DSP ASE) results.
naturally, three non-pipelined multiply ops per limb will be slower than a single multiply from the MIPS-II code on cores with MDU 1.
T> I took a quick look at the code. Do you use madd/msub for accumulation
T> here, while actual multiplication is done by multu? As MIPS lacks
not quite so.
as MDU 2+ is pipelined, the "multu $xx, 1" idiom is used to quickly reset the accumulator:
this is clearly shorter and faster than mthi/mtlo pair on in-order cores and *absolutely* critical on P5600 (and likely 74K* with a similar pipeline).
but the choice of maddu/multu depends on the subroutine cause the order matters, addmul_1 just happens to be more flexible than submul_1 regarding this.
T> It's long since I did any substantial work with MIPS, but it would
T> appear that, at least for addmul_1, madd could be used also for
T> multiplication. One should of course avoid creating a slow recurrency
T> path.
it is all about order: wrong placement of carry accumulation drops P5600 performance to 14 c/l so it is always at the end of the sequence.
doing "multu *up, vl; maddu *rp, 1" is slightly faster than "multu *rp, 1; maddu *up, v1" on P5600, especially for N=1,2.
More information about the gmp-devel
mailing list