How to calculate cycles/limb in assembly routines
Albin Ahlbäck
albin.ahlback at gmail.com
Fri Apr 5 16:20:20 CEST 2024
Thanks for the further explanation, Niels!
> For an assembly loop, one can find out from properties of the
> processor what cycle counts are implied by these three limits. It's
> often possible (but tedious) to tweak scheduling to get an actual
> speed pretty close to the limit. And it aids optimization to
> understand which one is the performance bottleneck.
[snip]
> I would expect the speed of such a hard-coded function to be limited
> by multiplier throughput (O(N^2)); it should be possible to arrange
> the order you add up the N^2 terms so that your carry chain
> corresponds to the size of the product (O(N)).
Yeah, sorry my benchmark was wrong, so it is only ~20% faster
asymptotically. Sorry for this noise.
Best,
Albin
More information about the gmp-devel
mailing list