Retrocomputing fun: Improve m68000 umul_ppmm
David Sparks
sparks05 at proton.me
Tue Feb 10 15:37:22 CET 2026
I was looking through the silicon archaeology in longlong.h (i860! Fairchild Clipper!)
when I noticed some low-hanging fruit in a place I didn't expect.
There are two separate obvious improvements that can be made to the
68000-without-MULU.L code path.
The first is to the summation of partial products. Summing
the middle products (ah*bl + al*bh) can produce a carry, but
there's no need for a conditional branch and large immediate to
handle it.
Old code (unchanged context in parens):
(addx.l %3,%2)
jcc 1f
add.l %#0x10000,%0
1: move.l %2,%3
clr.w %2
(swap %2)
(swap %3)
(clr.w %3)
(addx.l %3,%1)
(addx.l %2,%0)
The unparenthesized part is 12 bytes and 8+8+4+4 = 24 cycles
if carry set, or 10+4+4=18 cycles if carry clear.
The replacement is:
(addx.l %3,%2)
move.l %2,%3
clr.w %2
addx.w %2,%2
(swap %2)
(swap %3)
(clr.w %3)
(addx.l %3,%1)
(addx %2,%0)
Which is 6 bytes and 12 cycles.
diff --git a/longlong.h b/longlong.h
index 41ad4f980..e7b6858ca 100644
--- a/longlong.h
+++ b/longlong.h
@@ -1213,10 +1213,9 @@ extern UWtype __MPN(udiv_qrnnd) (UWtype *, UWtype, UWtype, UWtype);
" swap %2\n" \
" mulu%.w %5,%2\n" \
" add%.l %3,%2\n" \
-" jcc 1f\n" \
-" add%.l %#0x10000,%0\n" \
-"1: move%.l %2,%3\n" \
+" move%.l %2,%3\n" \
" clr%.w %2\n" \
+" addx%.w %2,%2\n" \
" swap %2\n" \
" swap %3\n" \
" clr%.w %3\n" \
The second improvement is a more minor one, to the computation
of the partial products. By tweaking the order, it's possible
to save one swap instruction.
diff --git a/longlong.h b/longlong.h
index e7b6858ca..031a092ec 100644
--- a/longlong.h
+++ b/longlong.h
@@ -1202,15 +1202,14 @@ extern UWtype __MPN(udiv_qrnnd) (UWtype *, UWtype, UWtype, UWtype);
#define umul_ppmm(xh, xl, a, b) \
do { USItype __umul_tmp1, __umul_tmp2; \
__asm__ ("| Inlined umul_ppmm\n" \
-" move%.l %5,%3\n" \
-" move%.l %2,%0\n" \
-" move%.w %3,%1\n" \
-" swap %3\n" \
+" move%.l %5,%0\n" \
+" move%.w %0,%1\n" \
" swap %0\n" \
" mulu%.w %2,%1\n" \
-" mulu%.w %3,%0\n" \
+" move%.w %0,%3\n" \
" mulu%.w %2,%3\n" \
" swap %2\n" \
+" mulu%.w %2,%0\n" \
" mulu%.w %5,%2\n" \
" add%.l %3,%2\n" \
" move%.l %2,%3\n" \
If I had a testing platform (I don't; the above is carefully
desk-checked but UNTESTED), I'd experiment with Karatsuba.
The problem is that the extra complexity really explodes the
register pressure.
More information about the gmp-devel
mailing list