Surprise performance from Apple M1

Sat Nov 21 18:50:27 UTC 2020

The GMP project got a low-end Apple Mac Mini M1 in order to make sure
GMP works for arm-macos systems.

We had a major surprise from the GMP performance of these CPUs!

No other CPU runs GMP this well.  Almost every inner loop runs at < 1
cycle/limb.  That inclues mpn_mul_1, but not the most important loop
mpn_addmul_1.  And that is before any attempt at optmising things for
the M1.

The 3.2 GHz M1 in our system takes the #2 spot in the GMPbench top-list.
The #1 spot is an AMD Ryzen, but that runs ar 4.4 GHz.

Getting mpn_addmul_1 to run closer to 1 cycle/limb would mean a lot for
GMP's performance.  There is an architecture shortcoming which might
make it tricky, though: There is just one carry/borrow flag, unlike
x86's two (as used by adcx/adox) and also there is no instruction for
highword(a*b+c).  As a result, addmul_1 which needs a 3-way add for its
product accumulation needs to add some words, save carry, restore carry,
add to the same words again, save carry, restore carry, etc.  That's
quite expensive.

X86 used to have that same problem.  They added adox/adcx which greatly
helped GMP.  IBM's Power used to have the same problem, and they added
both highword(a*b+c) *and* multiple carry flags.

-- 
Torbjörn
Please encrypt, key id 0xC8601622