Use of AVX instructions in mpn_mul_1
Thanassis Tsiodras
ttsiodras at gmail.com
Mon Jun 13 23:17:04 CEST 2022
Hi, GMP wizards.
I had a quick look at the x86_64 assembly implementations of the basic
primitive used in multiplications (mpn_mul_1), and saw this:
$ grep mul $(find . -type f | grep asm$ | grep x86_64 | grep /mul_1) | \
grep -v -P '\tmul\tv' | \
grep -v mul.uses | \
grep -v mpn_mul_ | \
grep -v mulx
./x86_64/silvermont/mul_1.asm:include_mpn(`x86_64/bd1/mul_1.asm')
./x86_64/pentium4/mul_1.asm:include_mpn(`x86_64/bd1/mul_1.asm')
./x86_64/goldmont/mul_1.asm:include_mpn(`x86_64/coreisbr/mul_1.asm')
Basically, after grep-ing out:
- instances of "mul v0", "mul v1", etc...
- comments mentioning that "mul" clobbers rdx...
- labels of mpn_mul_1...
- ...and uses of "mulx"...
...I could not find any use of AVX-integer-related multiplication
instructions.
I am talking about things like " _mm512_mul_epu32", which at first glance
seemed promising (8x32bit multiplications in one instruction generating
8x64-bit results in one go).
Then again, the generated 64 bit outputs from the 8 32x32 multiplications
would have to be add-/adc- "horizontally", shifted by 32-bits each...
I can't see a way to do that optimally. Is that the reason GMP asm code
seems to prefer the simple 64x64 => 128 instructions? (mul %rcx)
Asking as a curious x86-64 guy,
Thanassis.
P.S. Most of the asm codebases indicate in a comment that: "The loop...
code is the result of running a code generation and optimization tool suite
written by David Harvey and Torbjorn Granlund". Did this tool check AVX
(and in general, SIMD) instructions as well?
More information about the gmp-devel
mailing list