Use of AVX instructions in mpn_mul_1

Mon Jun 13 23:17:04 CEST 2022

Hi, GMP wizards.

I had a quick look at the x86_64 assembly implementations of the basic
primitive used in multiplications (mpn_mul_1), and saw this:

    $ grep mul $(find . -type f | grep asm$ | grep x86_64 | grep /mul_1) | \
         grep -v -P '\tmul\tv' | \
         grep -v mul.uses | \
         grep -v mpn_mul_ | \
         grep -v mulx
    ./x86_64/silvermont/mul_1.asm:include_mpn(`x86_64/bd1/mul_1.asm')
    ./x86_64/pentium4/mul_1.asm:include_mpn(`x86_64/bd1/mul_1.asm')
    ./x86_64/goldmont/mul_1.asm:include_mpn(`x86_64/coreisbr/mul_1.asm')

Basically, after grep-ing out:

- instances of "mul v0", "mul v1", etc...
- comments mentioning that "mul" clobbers rdx...
- labels of mpn_mul_1...
- ...and uses of "mulx"...

...I could not find any use of AVX-integer-related multiplication
instructions.
I am talking about things like " _mm512_mul_epu32", which at first glance
seemed promising (8x32bit multiplications in one instruction generating
8x64-bit results in one go).

Then again, the generated 64 bit outputs from the 8 32x32 multiplications
would have to be add-/adc- "horizontally", shifted by 32-bits each...

I can't see a way to do that optimally. Is that the reason GMP asm code
seems to prefer the simple 64x64 => 128 instructions?  (mul %rcx)

Asking as a curious x86-64 guy,
Thanassis.

P.S. Most of the asm codebases indicate in a comment that:  "The loop...
code is the result of running a code generation and optimization tool suite
written by David Harvey and Torbjorn Granlund". Did this tool check AVX
(and in general, SIMD) instructions as well?