Neon addmul_8

Sun Feb 24 19:13:56 CET 2013

On 2013-02-24 04:24, Torbjorn Granlund wrote:
> Richard Henderson <rth at twiddle.net> writes:
>
>    gcc -O2 -g3 [...] addmul_N.c -DN=8 -DCLOCK=1694000000
>    $ ./t.out
>    mpn_addmul_8:     2845ms (1.782 cycles/limb) [973.59 Gb/s]
>    mpn_addmul_8:     2620ms (1.641 cycles/limb) [1057.20 Gb/s]
>    mpn_addmul_8:     2625ms (1.644 cycles/limb) [1055.19 Gb/s]
>    mpn_addmul_8:     2625ms (1.644 cycles/limb) [1055.19 Gb/s]
>    mpn_addmul_8:     2625ms (1.644 cycles/limb) [1055.19 Gb/s]
>
> Wow!  You did it!
>
>    (Clearly there's something wrong with the bandwith calculation.  ;-)
>
> The "bandwidth" is the number of one-bit multiplies performed.  It makes
> m-bit addmul and and n-bit addmul (m=32 adn n=64 for example)
> comparable.

Ah, I thought it was some measure of the memory that must have been read 
and written in the process.

> It is a shame those vext insns are needed.  With all those neon addition
> instruction variants, I would have thought vext + vpaddl could be done
> using one insn...

Yeah, but the columns are all off by one.  Getting those lined up coming 
out of the multiplication would require just as much fiddling.

> One possible weakness of your code is that it separates multiplication
> and addition.  Many pipelines prefer a more balanced load.

Yeah, but I use up the entire multiplication portion doing the integer 
bookkeeping for the round.  I want to get that done asap so that the 
load instructions for the next round are issued as early as possible. 
And as far as I know, no ARM pipeline does more than dual issue.

> I realise that changing that might be easier said than done.  Perhaps
> the last vpaddl could be folded into the beginning of the loop, and
> perhaps the first vext could be foled into the last vmlals?

Possibly...

>    What sizes are really interesting for addmul decomposition?
>
> Do you mean k, as in addmul_k?  The smaller k we can find, the better.
> Any addmul_k for k > 1 is a compromise, sort of.

Fair enough.

> The way addmul_ is used is best seen in gmp/mpn/generic/mul_basecase.c.

I see.  And any amount of adding under the toom limit is reasonable?

I've started work on a mul_basecase.asm.  I'm calling out to the normal 
routines for mul_[12] and addmul_2.  I've got internal primitives for 
k=14, 8, and 6, which all share some code.  This gives:

	31:	1+14+14+2
	30:	2+14+14
	29:	1+14+14
	...
	14:	2+6+6
	13:	1+6+6
	12:	2+8+2
	11:	1+8+1
	10:	2+8
	 9:	1+8
	 8:	2+6
	 7:	1+6
	 6:	2+2+2
	 5:	1+2+2
	...

It will loop on the 14, so it can still be invoked above the toom limit,
but I also like the table at the end to handle the last 12 so that we 
get 6+6 instead of 8+2+2.

r~