[PATCH] Optimize 32-bit sparc T1 multiply routines.

Sun Jan 6 01:15:43 CET 2013

David Miller <davem at davemloft.net> writes:

  From: Torbjorn Granlund <tg at gmplib.org>
  Date: Fri, 04 Jan 2013 14:54:15 +0100

  > Did you add umulxhi use in your patch from a few days ago?

  Yes I did use mulx/umulxhi (both T3 and T4 have umulxhi) and yes the
  multiplies do pipeline on T4 (it doesn't on T3), and it gets about 4
  cycles per limb in a two-way unrolled loop in mul_1.  addmul_1 gets
  about 6.5 cycles per limb.

Could you please try my mul-only loop to determine the throughput?

It is a 2-issue pipeline, right?  So the two extra instructions for
addmul_1 compared to mul_1, if both are deply unrolled, should allow for
1 + epsilon differential cycle.

With two-way unrolling, we will get 3 extra instructions per way (5 or 6
in total per loop).  This still does not explain the slowdown from 4 c/l
to 6.5 c/l.

-- 
Torbjörn