How to calculate cycles/limb in assembly routines

Albin Ahlbäck albin.ahlback at
Fri Apr 5 16:20:20 CEST 2024

Thanks for the further explanation, Niels!

 > For an assembly loop, one can find out from properties of the
 > processor what cycle counts are implied by these three limits. It's
 > often possible (but tedious) to tweak scheduling to get an actual
 > speed pretty close to the limit. And it aids optimization to
 > understand which one is the performance bottleneck.


 > I would expect the speed of such a hard-coded function to be limited
 > by multiplier throughput (O(N^2)); it should be possible to arrange
 > the order you add up the N^2 terms so that your carry chain
 > corresponds to the size of the product (O(N)).

Yeah, sorry my benchmark was wrong, so it is only ~20% faster 
asymptotically. Sorry for this noise.


More information about the gmp-devel mailing list