tg at gmplib.org
Tue Feb 26 15:27:51 CET 2013
nisse at lysator.liu.se (Niels Möller) writes:
Hmm, I tried changing all output registers to unique registers (only
written once in the loop, never ever read (except as vmlal reads the
output register before accumulating to it). Do you mean that I need to
change the *input* registers of all instructions too?
Not if you manage to break all dependencies without doing that. beating
on just one input reg frees up every other register for the write-once
For simplicity, I did this to my addmul_4 loop. Cycle time dropped from
4.5 c/l to 4. I.e., 16 cycles needed to execute a loop consisting of 11
almost completlyindependent instructions,
2 ld1 (scalar)
1 st1 (scalar)
Then it's clear some execution unit is not able to keep up with
instrution decoding at all, right? I then try taking out some
instructions (but keeping load, store and looping):
no arithmetic: 1.75 c/l (only load, store, loop overhead)
vmlal only: 2.75 c/l (also the same with vmull instead of vmlal)
vext only: 2.5 c/l
vext+vmlal: 3.5 c/l
vpaddl only: 2.0 c/l
vpaddl+vext: 3.0 c/l
vpadd+vmlal: 3.0 c/l
all: 4.0 c/l
What conclusions can one draw from this exercise? It seems that vext and
vmlal compete for execution resources, while vpaddl can be done mostly
in parallell with the other operations.
That's an important conclusion. Perhaps one should avoid vext? (But
since these experiments where done on A9, we shouldn't make that
conclusion at all).
If vext is bad also for A15, I'd hope to use the 32 + 64 -> 64 add
instruction for all (k'ish) column summations, and only when a column is
about to get ready, add previous carry-in, and then shuffle as needed
with vext. That might reduce things to one vext per iteration.
Your vmlal performance seem strange. I can run one vmlal every 2nd
cycle on my A9, i.e., a c/l of throughput.
More information about the gmp-devel