Possible new T3-T5 mul_1
davem at davemloft.net
Tue Apr 2 21:30:57 CEST 2013
From: Torbjorn Granlund <tg at gmplib.org>
Date: Tue, 02 Apr 2013 21:09:51 +0200
> So what's going on in the a and b code variants? I assume the total OoO
> capacity was just not enough for a ld->mul->add 17 cycle chain scheduled
> at just 3+4 cycles. with fully scheduled loads, the OoO requirement was
> just 12 vs 4 cycles.
> I would assume that one could either choose to schedule loads >= 5
> cycles from mulx, or schedule the addxccc one or two cycles further away
> from mulx...
There can only be up to 128 instructions inside of the chip at one
time, this includes the full chip pipeline as well as the OoO buffer.
We could be running up against this.
So the thing to determine is how deep to schedule loads such that the
total number of incomplete as well as issuing instructions active at
one point does not exceed 128.
There are about 20 instructions per mul_1b.asm loop iteration that
could be queued up due to dependency resolution in the OoO unit.
And with only a 2 cycle mis-match between how the loads are scheduled
and how they'll complete, it doesn't seem like we'll generate enough
backlog to fill the chip up.
But, in any event, the mul_1a.asm startup and low count cost compared
to the existing T3 mul_1.asm is really not that bad.
More information about the gmp-devel