[PATCH] Optimize 32-bit sparc T1 multiply routines.
tg at gmplib.org
Sun Jan 6 01:15:43 CET 2013
David Miller <davem at davemloft.net> writes:
From: Torbjorn Granlund <tg at gmplib.org>
Date: Fri, 04 Jan 2013 14:54:15 +0100
> Did you add umulxhi use in your patch from a few days ago?
Yes I did use mulx/umulxhi (both T3 and T4 have umulxhi) and yes the
multiplies do pipeline on T4 (it doesn't on T3), and it gets about 4
cycles per limb in a two-way unrolled loop in mul_1. addmul_1 gets
about 6.5 cycles per limb.
Could you please try my mul-only loop to determine the throughput?
It is a 2-issue pipeline, right? So the two extra instructions for
addmul_1 compared to mul_1, if both are deply unrolled, should allow for
1 + epsilon differential cycle.
With two-way unrolling, we will get 3 extra instructions per way (5 or 6
in total per loop). This still does not explain the slowdown from 4 c/l
to 6.5 c/l.
More information about the gmp-devel