Adrien Prost-Boucle adrien.prost-boucle at
Mon Dec 19 17:21:30 UTC 2016

Hi Torbjörn,

I updated my code since my last message to the mailing list.
Now there is a small archive on my server, with the Makefile:
If you find some interest in my code, I can create a github/gitlab repo.

Basically my code will compile only when the macro MACHINE_X86_64 is set,
because that's my system and I was mostly interested in that.
The Makefile sets it automatically.
Also, GMP is a dependency, I added it because while doing debug of the double-word version I needed a good reference ;-) 

My fallback functions are not complete nor very efficient I fear, in particular smul64 is absent.
I implemented these fallback functions mostly as toy stuff because I have only very small interest in it.

My 32-bit version may still be "benchmarkable" on 64-bit systems/machines because it uses only 32-bit words.
(similar to GMP 32-bit implementation)
My benchmark function compares my 32-bit code against GMP 32-bit code, compiled with the very same options.
Although it's not identical to actually being on a 32-bit machine, this should give a good idea about potential speedup.

That said, the interesting part in my code is these functions:
- sqrt32_inv()    for single 32-bit words
- sqrt64_inv()    for single 64-bit words
- sqrt64x2_inv()  for double 64-bit words

All other algorithms in the code are also toy stuff,
but interesting from an implementation point of view, to understand which part of the algorithms cost time.

Note that I copied code from GMP to compare:
- sqrt32_mpn()    for single 32-bit words
- sqrt64_mpn()    for single 64-bit words
- sqrt64x2_mpn()  for double 64-bit words

The program enables to launch benchmark with these commands:
./sqrt bench32 10000000
./sqrt bench64 10000000
./sqrt bench64x2 10000000

It should compile without problem if you're on a 64-bit machine, at least with gcc.
Don't hesitate to ask in case you find some problems.



By the way about fallback functions (when there is no asm):
I noted that GMP fallback function umul_ppmm(), in longlong.h in GMP code,
uses 4 multiplications where the Karatsuba method would only requires 3,
I was wondering whether optimization was possible...

On Mon, 2016-12-19 at 17:07 +0100, Torbjörn Granlund wrote:
> Adrien!
> We need fast code for computing the square root of 32-bit and 64-bit
> numbers on 32-bit machines.
> We need fast code for computing the square root of 64-bit and 128-bit
> numbers on 64-bit machines.
> I.e., we need single-word and and double-word square root in both cases.
> Now, I am a bit confused about what you've implemented, and also how to
> compile your code.
> When trying a plain compile of your code, I get complaints about
> smul64x2 being undefined.  I spotted the MACHINE_X86_64 macro, which
> allows me to compile things.
> Benchmarking the 32-bit code for a 64-bit machine is not very
> interesting, at least not if that means we take advantage of 64-bit
> arithmetic.
> Please give me some guidance.  I'd like to start with a performance
> evaluation.

More information about the gmp-devel mailing list