VPMADD52

Sun Oct 11 23:40:15 UTC 2015

Yes, that is a depressing estimate...
The factor of 2 you estimate...that is for latency, but maybe
not for throughout. Not sure about that. Of course, there are no
timing estimates available for these instructions yet, but for
avx2, it looks like most of these types of instructions have
high latency but can issue one per cycle.

In any case, it sounds like this isn't in the cards any time soon.

Sent from my iPad

> On Oct 11, 2015, at 7:26 PM, tg at gmplib.org (Torbjörn Granlund) wrote:
> 
> Victor Shoup <shoup at cs.nyu.edu> writes:
> 
>  Within the next couple of years, we can expect to see
>  a new instruction on Intel chips: VPMADD52.
>  This will be a part of the AVX512 ISA, but it's not clear
>  when actually chips with these instructions will ship.
> 
>  One variant does an 8-way 52-bit x 52-bit -> low 52-bits
>  "fused multiply add" on integers. Another does the same,
>  but with the high-order 52-bits of the product.
>  Obviously, Intel is going to leverage their SIMD FP hardware 
>  for this...and one might also infer from this that true 64-bit
>  SIMD instructions are nowhere on Intel's roadmap.
> 
>  So the question is: what is GMP's roadmap for SIMD development,
>  and does it include any plans for VPMADD52?  I've been talking to a
>  fellow at Intel about this (Shay Gueron), who is potentially interested 
>  in contributing code to GMP.  I'm also interested, because of potential applications 
>  to my NTL library for faster multi-modular FFTs.
> 
> The current 64x64 -> 128 bit instruction performs 3 times more work than
> one way of these new instructions (1.51 times more because of width, 2
> times because it produces the full product with one insn instead of
> two).  So if one could make perfect use of all 8 ways we could hope for
> 2.3 times better performance @ the same clock if the instructions have
> the same throughput (which probably is a reasonable assumption).
> 
> Less excited already?
> 
> Making use of SIMD for a single bignum operation is difficult.  One
> typically needs to transfer intermediate results at a high rate to the
> integer register for final accumulation.
> 
> Due to the divide-and-conquer nature of GMP's multiply algorithms,
> operands tend to be fairly small.  They also don't come in multiple of 8
> words except about 1/8 of the times...
> 
> I'd be impressed if one could get close to 50% utilisation of an 8-way
> multiply feature such as this.  Now we're at 1.15 times speedup (if we
> assume 100% utilisation of plain 64x64->128 mul, which is getting closer
> to true for each Intel CPU generation).
> 
> Other important operations, such as 2-adic reductions (aka "Montgomery
> multiplication") will be even harder to deal with.
> 
> SIMD is hard to use on inherently single data.
> 
>  One concrete issue: if one wanted to fully exploit VPMADD52 instructions,
>  then perhaps that would be a good reason to enable the "nails" feature
>  in GMP.
> 
> "Nails" used to work a few years ago, but I expect some bitrot now.  It
> would probably take a day or two to make it work again.
> 
> 
> -- 
> Torbjörn
> Please encrypt, key id 0xC8601622