Some arm cortex-a8 improvements

Mon Apr 23 16:49:55 CEST 2012

Richard Henderson <rth at twiddle.net> writes:

  Indeed I know that the hw registers that allow such recognition
  are all privileged.  For linux they best one can do is /proc/cpuinfo
  or (to some extent) the values in AT_HWCAP.

Something portable would be nice...

  FYI, I dug out the add/mul_2.asm files I was working in in February.
  IIRC, they're correct as in they pass the testsuite, but I could not
  show them to be faster than the add/mul_1 paths.

Do you know the repeat rate of umull, umlal, umaal, assuming no reg
dependencies?

Usually, it is possible to come close to the mul throughput for some
addmul_N, N >= 1.

Forget mul_2, go for addmul_2, since the latter will be used repeatedly
from mul_basecase or sqr_basecase.

One will want to do latency scheduling for umaal, handling v0 (the low
limb of the 2-limb v operand) and v1 (its high limb) semi-seprately; one
first multiplies with v0, then limping along some cycles later,
multiply-adds v1.  I suppose umaal + umaal or umaal + umlal would both
work.

-- 
Torbjörn