[PATCH] T3/T4 sparc shifts, plus more timings

David Miller davem at davemloft.net
Thu Mar 28 18:55:28 CET 2013

From: Torbjorn Granlund <tg at gmplib.org>
Date: Thu, 28 Mar 2013 10:26:51 +0100

> I believe 4-way addmul_2 could be made to run at 3.25 c/l, while mul_2
> could run at 2.25 c/l.  And while GMP's configure will now have made
> addmul_2 the main workhorse for mul_basecase and sqr_basecase, we should
> also improve addmul_1.  A 4-way addmul_1 will need 31 insn and should
> run at 4 c/l.  A 4-way mul_1 will need 22 insns, or 2.27 c/l.

The TODO list grows, it never seems to shrink :-)

Wrt. scheduling mulx/umulxhi, I think to a certain extent I think the
out-of-order completion unit in the backend of the chip takes care of
most things.

For example, I'm pretty sure that if you have a mulx feeding an
immediate store, the store just gets queued up in the out-of-order
completion queue waiting for the mulx to finish.  The store address
gets evaluated in the pipline stages, the TLB lookup etc. occurs in
the next stage, and then the store just waits for the data before
inserting into the store buffer.

More information about the gmp-devel mailing list