How to calculate cycles/limb in assembly routines
Albin Ahlbäck
albin.ahlback at gmail.com
Fri Apr 5 01:51:42 CEST 2024
Thanks for the fast and helpful reply!
I see, I definitely need to read up on the CPU pipelines. I also tested
one of your automated scripts for measuring cycles per limbs for a
variety of functions, and it checks out.
Anyway, in regards to the performance of multiplication: I did manage to
write some half-hardcoded that outperforms the mpn_mul_basecase quite a
bit on Apple M1 (only tested on the Mac Mini on cfarm). They are
basically on the form
mpn_mul_N(mp_ptr, mp_srcptr, mp_size_t, mp_srcptr)
for N in 1, 2, ..., 15. I recall that this translated very well into the
Toom-Cook territories (when using this, the cutoff between Toom22 using
these underlying algorithms and GMP's Toom33 is at ~480 limbs, pretty
impressive!(?)). For instance, with N = 8 it is 80% faster
asymptotically than mpn_mul_basecase on M1. They do, however, span a lot
of code as each case has to be handcoded, so I suppose they would not
fit into GMP.
Anyway, thanks for your reply!
Best,
Albin
More information about the gmp-devel
mailing list