T3/T3 mul_2 and addmul_2

Fri Mar 8 04:38:21 CET 2013

From: Torbjorn Granlund <tg at gmplib.org>
Date: Fri, 08 Mar 2013 04:16:25 +0100

> I only now spotted FPMADDXHI and FPMADDX.  No Sun/Oracle SPARC hae been
> a floating-point demon, and these intger multiply instructions are
> performed in the fpu.

I wouldn't even bother going there, these instructions have at best an
11 cycle latency on the chips where they are supported.

They did really crazy stuff with the FPU vs INT units on T4, basically
there are "really fast" instructions that operate on the FPU registers
quickly in the front end of the chip and with 1-3 cycles of latency.

This would include things like fsrc2 (but not fsrc1 :-) to move values
around, things like movdtox/movxtod/etc. and bit logical operations like
'fxor'.

Basically everything needed to facilitate fast unaligned memcpy, and
cryptography (the AES, DES, SHA, etc. crypto instructions operate in
the FPU), those are done in the fast path.

But everything else uses the slow path FPU which has a large latency,
and does bypassing and pipelineing outside of those "fast path"
instructions.

So if you mix, trying to use a fast path instruction to access the
result of a slow path instruction, you stall until the slow path FPU
instruction can return the result back to the front end of the cpu.

Really, just stay out of the FPU.  It's a lot better in T4 than it
used to be back in the T1/T2 days, but not by much.