Neon addmul_8
Richard Henderson
rth at twiddle.net
Sun Feb 24 19:13:56 CET 2013
On 2013-02-24 04:24, Torbjorn Granlund wrote:
> Richard Henderson <rth at twiddle.net> writes:
>
> gcc -O2 -g3 [...] addmul_N.c -DN=8 -DCLOCK=1694000000
> $ ./t.out
> mpn_addmul_8: 2845ms (1.782 cycles/limb) [973.59 Gb/s]
> mpn_addmul_8: 2620ms (1.641 cycles/limb) [1057.20 Gb/s]
> mpn_addmul_8: 2625ms (1.644 cycles/limb) [1055.19 Gb/s]
> mpn_addmul_8: 2625ms (1.644 cycles/limb) [1055.19 Gb/s]
> mpn_addmul_8: 2625ms (1.644 cycles/limb) [1055.19 Gb/s]
>
> Wow! You did it!
>
> (Clearly there's something wrong with the bandwith calculation. ;-)
>
> The "bandwidth" is the number of one-bit multiplies performed. It makes
> m-bit addmul and and n-bit addmul (m=32 adn n=64 for example)
> comparable.
Ah, I thought it was some measure of the memory that must have been read
and written in the process.
> It is a shame those vext insns are needed. With all those neon addition
> instruction variants, I would have thought vext + vpaddl could be done
> using one insn...
Yeah, but the columns are all off by one. Getting those lined up coming
out of the multiplication would require just as much fiddling.
> One possible weakness of your code is that it separates multiplication
> and addition. Many pipelines prefer a more balanced load.
Yeah, but I use up the entire multiplication portion doing the integer
bookkeeping for the round. I want to get that done asap so that the
load instructions for the next round are issued as early as possible.
And as far as I know, no ARM pipeline does more than dual issue.
> I realise that changing that might be easier said than done. Perhaps
> the last vpaddl could be folded into the beginning of the loop, and
> perhaps the first vext could be foled into the last vmlals?
Possibly...
> What sizes are really interesting for addmul decomposition?
>
> Do you mean k, as in addmul_k? The smaller k we can find, the better.
> Any addmul_k for k > 1 is a compromise, sort of.
Fair enough.
> The way addmul_ is used is best seen in gmp/mpn/generic/mul_basecase.c.
I see. And any amount of adding under the toom limit is reasonable?
I've started work on a mul_basecase.asm. I'm calling out to the normal
routines for mul_[12] and addmul_2. I've got internal primitives for
k=14, 8, and 6, which all share some code. This gives:
31: 1+14+14+2
30: 2+14+14
29: 1+14+14
...
14: 2+6+6
13: 1+6+6
12: 2+8+2
11: 1+8+1
10: 2+8
9: 1+8
8: 2+6
7: 1+6
6: 2+2+2
5: 1+2+2
...
It will loop on the 14, so it can still be invoked above the toom limit,
but I also like the table at the end to handle the last 12 so that we
get 6+6 instead of 8+2+2.
r~
More information about the gmp-devel
mailing list