Neon addmul_8

Sun Feb 24 19:39:46 CET 2013

Richard Henderson <rth at twiddle.net> writes:

  Yeah, but I use up the entire multiplication portion doing the integer
  bookkeeping for the round.  I want to get that done asap so that the
  load instructions for the next round are issued as early as
  possible. And as far as I know, no ARM pipeline does more than dual
  issue.

Isn't A15 3-issue?

  > The way addmul_ is used is best seen in gmp/mpn/generic/mul_basecase.c.

  I see.  And any amount of adding under the toom limit is reasonable?

Sorry, I don't understand.

  I've started work on a mul_basecase.asm.  I'm calling out to the
  normal routines for mul_[12] and addmul_2.  I've got internal
  primitives for k=14, 8, and 6, which all share some code.  This gives:

  	31:	1+14+14+2
  	30:	2+14+14
  	29:	1+14+14
  	...
  	14:	2+6+6
  	13:	1+6+6
  	12:	2+8+2
  	11:	1+8+1
  	10:	2+8
  	 9:	1+8
  	 8:	2+6
  	 7:	1+6
  	 6:	2+2+2
  	 5:	1+2+2
  	...

  It will loop on the 14, so it can still be invoked above the toom limit,
  but I also like the table at the end to handle the last 12 so that we
  get 6+6 instead of 8+2+2.

An addmul_14?  Zany.  :-)

There is a technique which is extremely important in mul_basecase, in
particular when basing it on deeply sw pipelined addmuls.  I call it
overlapping software pipelining, or OSP...

Let's start by identifying some building blocks.

IL = Inner loop
FI = Feed-in code for IL
WD = Wind-down code for IL

Grapically we could do something like this:

|\
| \
|  \
|FI \
|____\
|     |
|     |
| IL  |
|     |
|_____|
 \    |
  \WD |
   \  |
    \ |
     \|

The IL block will use execution resourses well.  FI will start with
loads, then after some cycles perform multiplies, then several cycles
after that start accumulating things.  The WD block will make no loads,
but in teh beginning perform a mix of old accumulations and new
multiplies, and later just perform accumulations, and finally just
stores.

For mul_basecase we can overlap the previous iteration WD with the new
iteration FI, which will save many, many cycles.

|\
| \
|  \
|FI \
|____\
|     |
|     |
| IL  |
|     |
|_____|
|\    |
| \WD |
|  \  |
|FI \ |
|____\|
|     |
|     |
| IL  |
|     |
|_____|
|\    |
| \WD |
|  \  |
|FI \ |
|____\|
   .
   .
   .
 _____
 \    |
  \WD |
   \  |
    \ |
     \|

OK, so perhaps there in one way of beating your code.  Let's consider
vertical summation for a while!

If we form product sums like

S_i = u_{k-1} * v_{0} + u_{k-2} * v{1} + ... + u_{0} * v{k-1}

akin to a convolution, we will (as long as k is kept < 2^b for a b-bit
word size) will form a 3-word sum S_i.  The word with (relative)
significance 1 is to be written "here", the word with significance 2^b
is to be sent as carry-in to the next sum, S_{i+1}, and the word with
significance 2^{2b} is to be sent as carry in the sum S_{i+2}.

-- 
Torbjörn