ARM public key benchmark

Thu Apr 4 18:36:15 CEST 2013

nisse at lysator.liu.se (Niels Möller) writes:

  I guess it's lowest numbered first (and lowest memory address).

  But a loop with

    use	r7  
    ldm	up!, {r4,r5,r6,r7}
    use	r4

  looks like poor scheduling betwen load of r4 and use of it, and the ldm
  can't be moved earlier since it clobbers r7. But I have a pretty vague
  idea about how this really works.

I haven't explored the ARM chips enough to know thi either.

A possible schedule is to put a stm in the ldm latency time slot:

   ldm {r4-r7}
   stm {r8-r11}
   ldm {r8-r11}
   operate on r4-r7
   operate on r4-r11

The "operate on" blocks don't need to be as disjoint as the picture
seems to suggest.

  Right, rewriting the loop with 3-way unrolling would be an interesting
  experiment. But I don't think I'll look into that soon.

The current improvement is very good already.

It's hard to organise the ARM code.  Since we have a very incomplete set
of systems, we might choose a poor asm file vector for some systems.

If you new code runs well on A15, perhaps we should assume it is good
for other systems which support umaal (>= armv6) and put it in the v6
subdir?  I'll push my new addmul_1 and submul_1 to the corea15 subdir at
some point (unless your code beats it, of course).

-- 
Torbjörn