ARM public key benchmark
Torbjorn Granlund
tg at gmplib.org
Thu Apr 4 18:36:15 CEST 2013
nisse at lysator.liu.se (Niels Möller) writes:
I guess it's lowest numbered first (and lowest memory address).
But a loop with
use r7
ldm up!, {r4,r5,r6,r7}
use r4
looks like poor scheduling betwen load of r4 and use of it, and the ldm
can't be moved earlier since it clobbers r7. But I have a pretty vague
idea about how this really works.
I haven't explored the ARM chips enough to know thi either.
A possible schedule is to put a stm in the ldm latency time slot:
ldm {r4-r7}
stm {r8-r11}
ldm {r8-r11}
operate on r4-r7
operate on r4-r11
The "operate on" blocks don't need to be as disjoint as the picture
seems to suggest.
Right, rewriting the loop with 3-way unrolling would be an interesting
experiment. But I don't think I'll look into that soon.
The current improvement is very good already.
It's hard to organise the ARM code. Since we have a very incomplete set
of systems, we might choose a poor asm file vector for some systems.
If you new code runs well on A15, perhaps we should assume it is good
for other systems which support umaal (>= armv6) and put it in the v6
subdir? I'll push my new addmul_1 and submul_1 to the corea15 subdir at
some point (unless your code beats it, of course).
--
Torbjörn
More information about the gmp-devel
mailing list