How to calculate cycles/limb in assembly routines

Niels Möller nisse at lysator.liu.se
Fri Apr 5 08:40:07 CEST 2024


Albin Ahlbäck <albin.ahlback at gmail.com> writes:

> I see, I definitely need to read up on the CPU pipelines. I also
> tested one of your automated scripts for measuring cycles per limbs
> for a variety of functions, and it checks out.

As I understand it, there are three main things that limit the
performance of the kinds of assembly loops we have in gmp (most
operating on small well cached data):

1. Recurrency latency, like the carry dependency Torbjörn explained.
   Basically, the sum of latencies for the chain of instructions
   representing how carry out depends on carry in.

2. Throughput of particular execution units, e.g, how many multiply
   instructions that can be started per cycle (to be completed some
   cycles later, depending on the latency of the multiplication
   pipeline).

3. Instruction issue, the processor's limit on how many instructions can
   be decoded and started each cycle. E.g., if processor can start 3
   instructions per cycle in parallel, and your loop is 15 instructions,
   it's not going to be faster than 5 cycles.

For an assembly loop, one can find out from properties of the processor
what cycle counts are implied by these three limits. It's often possible
(but tedious) to tweak scheduling to get an actual speed pretty close to
the limit. And it aids optimization to understand which one is the
performance bottleneck.

> Anyway, in regards to the performance of multiplication: I did manage
> to write some half-hardcoded that outperforms the mpn_mul_basecase
> quite a bit on Apple M1 (only tested on the Mac Mini on cfarm). They
> are basically on the form
>
> 	mpn_mul_N(mp_ptr, mp_srcptr, mp_size_t, mp_srcptr)
>
> for N in 1, 2, ..., 15. 

I would expect the speed of such a hard-coded function to be limited by
multiplier throughput (O(N^2)); it should be possible to arrange the
order you add up the N^2 terms so that your carry chain corresponds to
the size of the product (O(N)).

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.


More information about the gmp-devel mailing list