How to calculate cycles/limb in assembly routines

Albin Ahlbäck albin.ahlback at gmail.com
Fri Apr 5 01:51:42 CEST 2024


Thanks for the fast and helpful reply!

I see, I definitely need to read up on the CPU pipelines. I also tested 
one of your automated scripts for measuring cycles per limbs for a 
variety of functions, and it checks out.

Anyway, in regards to the performance of multiplication: I did manage to 
write some half-hardcoded that outperforms the mpn_mul_basecase quite a 
bit on Apple M1 (only tested on the Mac Mini on cfarm). They are 
basically on the form

	mpn_mul_N(mp_ptr, mp_srcptr, mp_size_t, mp_srcptr)

for N in 1, 2, ..., 15. I recall that this translated very well into the 
Toom-Cook territories (when using this, the cutoff between Toom22 using 
these underlying algorithms and GMP's Toom33 is at ~480 limbs, pretty 
impressive!(?)). For instance, with N = 8 it is 80% faster 
asymptotically than mpn_mul_basecase on M1. They do, however, span a lot 
of code as each case has to be handcoded, so I suppose they would not 
fit into GMP.

Anyway, thanks for your reply!

Best,
Albin


More information about the gmp-devel mailing list