How to calculate cycles/limb in assembly routines
Torbjörn Granlund
tg at gmplib.org
Thu Apr 4 22:19:34 CEST 2024
Albin Ahlbäck <albin.ahlback at gmail.com> writes:
I am looking at Torbjörn's `aorsmul_1.asm' for Apple M1, and I am
having trouble understanding how the cycles per limb number was
calculated.
It was not calculated. It was measured.
But that asm code was written with understanding of the M1 pipeline.
Here, 1.25 c/l was actually what was expected.
As I understand it, the cycles per limb number represents the loop(s)
in any routine. Looking at the main loop, it seems like it should
scale at 10 cycles per loop (of which 2 cycles are lost due to latency
from loading x4, I believe), for which it treats four limbs from `up'
at a time. However, the given number is 1.25 which is half the size of
my calculated 10 / 4.
I cannot follow your reasoning here, but as it seems incorrect from
measurements, I am not too worried. :-)
The performance limit to 1.25 c/l (5 cycles per loop) comes from these
lines:
ADDSUB x8, x8, CY C W0 carry-in
ADDSUBC x4, x4, x9 C W1
ADDSUBC x5, x5, x10 C W2
ADDSUBC x6, x6, x11 C W2
csinc CY, x7, x7, COND C W3 carry-out
These 5 instructions form a circular dependency, and each instruction
has a one cycle latency, so 5 cycles in total.
There are other parts of the loop with also sets performance limits,
e.g., the mul/umulh has a throughput of 2 per cycle, so even with the
above circular dependency removed, the loop would need 4 cycles.
I hope this reply helps. Modern CPUs have complex pipelines, with lots
of buffering to cate for dependencies and very wide exeecution in the
absence of dependencies.
--
Torbjörn
Please encrypt, key id 0xC8601622
More information about the gmp-devel
mailing list