Some arm cortex-a8 improvements

Tue Apr 24 17:07:04 CEST 2012

Richard Henderson <rth at twiddle.net> writes:

  > On my system, umaal has a latency if 3, whatever dependencies I create.
  > (There are 4 input regs and 2 output, so there are quite a few
  > possible dependency combinations; I only tried a subset.)
  > 
  > Either the docs are plain wrong, or there are several variants of A9.

  Dunno.  It's at this point that I'd try asking one of the arm
  guys from the gcc list and see if they can get an answer from
  somewhere inside the company.

I suppose I need to do that, to get the GMP arm code in a really good
shape.  The ARM landscape if completely new to me, and it seems at least
as vast as the x86 landscape.

I pushed an addmul_2 running at 2.38 cycles per limb product.  It is
trivial to make it run at 2.25 c/l, at the expense of using more callee-
saves registers.

Again, the innerloop uses no explicit add instructions, just umaal pairs.
Number speed louder than words, this code disproves ARM's cycle numbers
as it would not run at this speed with a latency of 5 cycles.

I expect to reach 2 c/l for addmul_3 or addmul_4.  Unlike other fast
processors, A9 doesn't make fast code huge, so actually implementing
addmul_3/4 seem quite reasonable.

Next thing to do is to write mul_basecase, sqr_basecase, mullo_basecase.
These just reuse the addmul_2 and addmul_1 loops, but doing so saves a
lot of overhead (i.e., it decreases the constant of the O(n) term).

-- 
Torbjörn