Some secondary asm T3,T4,T5 functions

David Miller davem at
Thu Apr 4 20:21:29 CEST 2013

From: Torbjorn Granlund <tg at>
Date: Thu, 04 Apr 2013 12:50:04 +0200

> David Miller <davem at> writes:
>   Attached is a dive_1.asm that works for me on real hardware as
>   well as T4 timings from:
>   tune/speed -p10000000 -s1-1000 -f1.1 -C mpn_divexact_1.3
> This timing is most curious.  The cost of inversion computation should
> be clearly visible for tiny sizes; I'd expect more than 43 cycles there.
> There are 6 chained mulx instructions for the inversion!
> And then performance hits about 20 c/l early, only to drop to something
> much worse.
> Have you seen anything similar for other routines?

We've seen this for cases where we were re-issuing multiplies within 12
cycles of the last iteration.

I suspect these are simply OoO unit artifacts.

I wonder if CPU cycle counter reads really flush the full pipeline, or
whether OoO execution can overlap part of it.

But even so, the numbers look very strange to me too.  I reran it just
now to make sure I reported it correctly, and I still get the same

More information about the gmp-devel mailing list