Retrocomputing fun: Improve m68000 umul_ppmm

David Sparks sparks05 at proton.me
Tue Feb 10 15:37:22 CET 2026


I was looking through the silicon archaeology in longlong.h (i860!  Fairchild Clipper!)
when I noticed some low-hanging fruit in a place I didn't expect.

There are two separate obvious improvements that can be made to the
68000-without-MULU.L code path.

The first is to the summation of partial products.  Summing
the middle products (ah*bl + al*bh) can produce a carry, but
there's no need for a conditional branch and large immediate to
handle it.

Old code (unchanged context in parens):
	(addx.l	%3,%2)
	jcc	1f
	add.l	%#0x10000,%0
1:	move.l	%2,%3
	clr.w	%2
	(swap	%2)
	(swap	%3)
	(clr.w	%3)
	(addx.l	%3,%1)
	(addx.l	%2,%0)
    
The unparenthesized part is 12 bytes and 8+8+4+4 = 24 cycles
if carry set, or 10+4+4=18 cycles if carry clear.
    
The replacement is:
	(addx.l	%3,%2)
	move.l	%2,%3
	clr.w	%2
	addx.w	%2,%2
	(swap	%2)
	(swap	%3)
	(clr.w	%3)
	(addx.l	%3,%1)
	(addx	%2,%0)
Which is 6 bytes and 12 cycles.

diff --git a/longlong.h b/longlong.h
index 41ad4f980..e7b6858ca 100644
--- a/longlong.h
+++ b/longlong.h
@@ -1213,10 +1213,9 @@ extern UWtype __MPN(udiv_qrnnd) (UWtype *, UWtype, UWtype, UWtype);
 "	swap	%2\n"							\
 "	mulu%.w	%5,%2\n"						\
 "	add%.l	%3,%2\n"						\
-"	jcc	1f\n"							\
-"	add%.l	%#0x10000,%0\n"						\
-"1:	move%.l	%2,%3\n"						\
+"	move%.l	%2,%3\n"						\
 "	clr%.w	%2\n"							\
+"	addx%.w	%2,%2\n"						\
 "	swap	%2\n"							\
 "	swap	%3\n"							\
 "	clr%.w	%3\n"							\


The second improvement is a more minor one, to the computation
of the partial products.  By tweaking the order, it's possible
to save one swap instruction.

diff --git a/longlong.h b/longlong.h
index e7b6858ca..031a092ec 100644
--- a/longlong.h
+++ b/longlong.h
@@ -1202,15 +1202,14 @@ extern UWtype __MPN(udiv_qrnnd) (UWtype *, UWtype, UWtype, UWtype);
 #define umul_ppmm(xh, xl, a, b) \
   do { USItype __umul_tmp1, __umul_tmp2;				\
 	__asm__ ("| Inlined umul_ppmm\n"				\
-"	move%.l	%5,%3\n"						\
-"	move%.l	%2,%0\n"						\
-"	move%.w	%3,%1\n"						\
-"	swap	%3\n"							\
+"	move%.l	%5,%0\n"						\
+"	move%.w	%0,%1\n"						\
 "	swap	%0\n"							\
 "	mulu%.w	%2,%1\n"						\
-"	mulu%.w	%3,%0\n"						\
+"	move%.w	%0,%3\n"						\
 "	mulu%.w	%2,%3\n"						\
 "	swap	%2\n"							\
+"	mulu%.w	%2,%0\n"						\
 "	mulu%.w	%5,%2\n"						\
 "	add%.l	%3,%2\n"						\
 "	move%.l	%2,%3\n"						\

If I had a testing platform (I don't; the above is carefully
desk-checked but UNTESTED), I'd experiment with Karatsuba.
The problem is that the extra complexity really explodes the
register pressure.


More information about the gmp-devel mailing list