VPMADD52

Sun Oct 11 23:26:07 UTC 2015

Victor Shoup <shoup at cs.nyu.edu> writes:

  Within the next couple of years, we can expect to see
  a new instruction on Intel chips: VPMADD52.
  This will be a part of the AVX512 ISA, but it's not clear
  when actually chips with these instructions will ship.

  One variant does an 8-way 52-bit x 52-bit -> low 52-bits
  "fused multiply add" on integers. Another does the same,
  but with the high-order 52-bits of the product.
  Obviously, Intel is going to leverage their SIMD FP hardware 
  for this...and one might also infer from this that true 64-bit
  SIMD instructions are nowhere on Intel's roadmap.

  So the question is: what is GMP's roadmap for SIMD development,
  and does it include any plans for VPMADD52?  I've been talking to a
  fellow at Intel about this (Shay Gueron), who is potentially interested 
  in contributing code to GMP.  I'm also interested, because of potential applications 
  to my NTL library for faster multi-modular FFTs.

The current 64x64 -> 128 bit instruction performs 3 times more work than
one way of these new instructions (1.51 times more because of width, 2
times because it produces the full product with one insn instead of
two).  So if one could make perfect use of all 8 ways we could hope for
2.3 times better performance @ the same clock if the instructions have
the same throughput (which probably is a reasonable assumption).

Less excited already?

Making use of SIMD for a single bignum operation is difficult.  One
typically needs to transfer intermediate results at a high rate to the
integer register for final accumulation.

Due to the divide-and-conquer nature of GMP's multiply algorithms,
operands tend to be fairly small.  They also don't come in multiple of 8
words except about 1/8 of the times...

I'd be impressed if one could get close to 50% utilisation of an 8-way
multiply feature such as this.  Now we're at 1.15 times speedup (if we
assume 100% utilisation of plain 64x64->128 mul, which is getting closer
to true for each Intel CPU generation).

Other important operations, such as 2-adic reductions (aka "Montgomery
multiplication") will be even harder to deal with.

SIMD is hard to use on inherently single data.

  One concrete issue: if one wanted to fully exploit VPMADD52 instructions,
  then perhaps that would be a good reason to enable the "nails" feature
  in GMP.

"Nails" used to work a few years ago, but I expect some bitrot now.  It
would probably take a day or two to make it work again.

-- 
Torbjörn
Please encrypt, key id 0xC8601622