Some basic questions on the invert_limb code

Anil Singhar anil.singhar at
Wed Nov 20 06:13:06 UTC 2013

Hi Torbjorn,

Sorry, may be I didn't understand you fully. Which repository are you
referring to? The 5.1.3 gmp code I started off as base I had downloaded
from the official gmp website. I don't know if there is any other

Yes, I agree with you that we need to wait until we have hardware or cycle
accurate simulation. But my goal this time was to have something with
aarch64 support for all functions, which builds and which is functionally
accurate (but not necessarily optimized) so that I can focus on
optimizations once the hardware or cycle accurate simulation is available.
If your team is planning to do this in future I am OK with that. I am
available to do this as well, but will need some help in understanding the
existing code, algorithms and scope etc.

We can revisit this activity after hardware is available, which will be
probably sometime next year. I hope that is OK with you. Let me know your

Thanks and Regards,

On 19 November 2013 19:07, Torbjorn Granlund <tg at> wrote:

> Anil Singhar <anil.singhar at> writes:
>   Yes, I started off with 5.1.3 code and have already spent more than a
> month now
>   coding the MPN functions in aarch64 assembly along the lines of ARM. I
> did so
>   assuming the advisory mentioned in the gmp manual that these functions
> are
>   generally implemented in assembly for speed. Sorry, I didn't understand
> the
>   full scope your first reply to my query in October. I got stuck with
>   invert_limb since it was bit non-trivial to implement and hence decided
> to ask.
> You should compare your code to the code we already have.  (Did I
> mention the GMP code repository...?)
> Perhaps you have implemented something better than us, or have a more
> complete set of functions.
> We only implememted a basic set, where speedup for real hardware seems
> likely, no matter how its pipeline works.
> The most critical operation for any CPU-specific GMP optimisation
> project is limb product accumulation.  Since A57 presumably can only
> form a 64 x 64 -> 128 bit product every 7th cycle, at least when using
> "core register" operations, one needs to look into alternative, SIMD
> formulations.  Such code is quite tricky, but should give at least a
> 2-fold speedup on A57.  I don't think such code should be written
> without hardware (or cycle accurate simulation).
> --
> Torbjörn

More information about the gmp-discuss mailing list