Leaky multiply instruction on Cortex-A75

Tue Dec 18 07:30:00 UTC 2018

tg at gmplib.org (Torbjörn Granlund) writes:

> Ref: https://static.docs.arm.com/101398/0200/arm_cortex_a75_software_optimization_guide_v2.pdf
> Cf pages 15-16.
>
> I haven't seen a leaky multiply instruction on a mainstream CPU since
> the days of POWER3, i.e., in 20 years.

If I manage to read those footnotes correctly, the only multiply
instruction with that data dependency is "Multiply accumulate X-form",
footnote 10, page 15. Maybe we can just avoid that instruction?

But I don't quite get the numbers (and I'm not familiar with the
instruction set, so I'm not really sure which of the instructions are
relevant for GMP loops). UMADDL looks reasonable (3 cycles latency for
the factors, but only 1 for the addend). But UMULHI looks much slower,
so maybe I'm missing something?

Does any arch have a multiply-accumulate-high instruction, producing
floor (a*b+c)/B ? Which would be very useful, if latency for the c input
is small. Would be a reasonably cheap extension if the hardware is based
on a Dadda multiplier (+ pipelining flip-flops at suitable places). See
https://en.wikipedia.org/wiki/Dadda_multiplier

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.