How to calculate cycles/limb in assembly routines

Thu Apr 4 22:19:34 CEST 2024

Albin Ahlbäck <albin.ahlback at gmail.com> writes:

  I am looking at Torbjörn's `aorsmul_1.asm' for Apple M1, and I am
  having trouble understanding how the cycles per limb number was
  calculated.

It was not calculated.  It was measured.

But that asm code was written with understanding of the M1 pipeline.
Here, 1.25 c/l was actually what was expected.

  As I understand it, the cycles per limb number represents the loop(s)
  in any routine. Looking at the main loop, it seems like it should
  scale at 10 cycles per loop (of which 2 cycles are lost due to latency
  from loading x4, I believe), for which it treats four limbs from `up'
  at a time. However, the given number is 1.25 which is half the size of
  my calculated 10 / 4.

I cannot follow your reasoning here, but as it seems incorrect from
measurements, I am not too worried.  :-)

The performance limit to 1.25 c/l (5 cycles per loop) comes from these
lines:

	ADDSUB	x8, x8, CY		C W0	carry-in
	ADDSUBC	x4, x4, x9		C W1
	ADDSUBC	x5, x5, x10		C W2
	ADDSUBC	x6, x6, x11		C W2
	csinc	CY, x7, x7, COND	C W3	carry-out

These 5 instructions form a circular dependency, and each instruction
has a one cycle latency, so 5 cycles in total.

There are other parts of the loop with also sets performance limits,
e.g., the mul/umulh has a throughput of 2 per cycle, so even with the
above circular dependency removed, the loop would need 4 cycles.

I hope this reply helps.  Modern CPUs have complex pipelines, with lots
of buffering to cate for dependencies and very wide exeecution in the
absence of dependencies.

-- 
Torbjörn
Please encrypt, key id 0xC8601622