Adrien Prost-Boucle adrien.prost-boucle at
Sat Jan 28 16:49:38 UTC 2017

Hi Marco,

Thank you for the tests,
and for the example of benchmarking with GMP.

We must be adamant about the correctness issue of course.
Recently I've been contacted separately by Bradley Lucier,
who has discovered an error (-1) for a particular input of 64x2 version...
So yes, correctness issue.

Bradley Lucier also proposed to try a completely different approach:
simply use floating-point functions sqrtf(), sqrt(), sqrtl().

Indeed recent and not-so-recent CPUs have floating-point sqrt instructions.
Also, if I understand correctly, IEEE standard imposes that the sqrt result
is nondecreasing as the input value increases.
This makes it possible to ensure correctness.

Basically, the algorithm is:
- convert the input value to floating-point (float, double or long double)
- call sqrtf(), sqrt() or sqrtl()
- convert back to integer
- apply correction -1 or +1 if necessary

At the end of this message is my code for 32b, 64b and 2x64b.
By default I check only for correction -1 because it seems only this is necessary,
due to the nondecreasing property of the sqrt instructions.

Here is is quick benchmark on my Core2 Duo mobile CPU:
(speedup is around 2x compared to mpn)

./sqrt -noslow bench32   10000000
null ....... 0.0135976 s
mpn ........ 0.134436 s
inv ........ 0.129447 s
fp-f ....... 0.0614301 s
fp-d ....... 0.0955542 s

./sqrt -noslow bench64   10000000
null ....... 0.0113118 s
mpn ........ 0.238126 s
inv ........ 0.225985 s
fp-d ....... 0.115913 s
fp-ld ...... 0.117514 s

./sqrt -noslow bench64x2 10000000
null ....... 0.0123146 s
mpn ........ 0.709794 s
inv ........ 0.606813 s
fp-ld ...... 0.223781 s

Here is is quick benchmark on a higher-end i7 CPU:
(speedup is at least 3x compared to mpn)

./sqrt -noslow bench32   10000000
null ....... 0.00276399 s
mpn ........ 0.0957994 s
inv ........ 0.0418732 s
fp-f ....... 0.0194476 s
fp-d ....... 0.0376165 s

./sqrt -noslow bench64   10000000
null ....... 0.00264621 s
mpn ........ 0.137777 s
inv ........ 0.0551834 s
fp-d ....... 0.0389557 s
fp-ld ...... 0.0494387 s

./sqrt -noslow bench64x2 10000000
null ....... 0.00257349 s
mpn ........ 0.181686 s
inv ........ 0.191645 s
fp-ld ...... 0.0610073 s

When compiled with gcc with some optim, the calls to sqrt() are not even present in the compiled program.
It is "inlined" by the appropriate CPU sqrt instructions.
So it's very efficient.

However, floating-point stuff is probably not a good solutions for embedded systems,
who may not have FP CPU instruction set.
ARM machines without the NEON extension, in particular.
So the current fully interger-base code should be kept.

I have updated my standalone test program with these algorithms:
make bench

About my "inv" algorithm:
I fortuitously discovered that, on some systems and with some compilers,
the generated code is much slower than "mpn" (Debian with gcc 5.4).
This is about using int32_t types. Using int_fast32_t instead solved the issue.

For a quick glance at my code, here it is.
Bradley Lucier noted that right shift of negative variables is undefined.
There have already been some discussion in the GMP mailing lists about that,
so I'll just let you guys decide.
I compile with -fno-math-errno to tell gcc that sqrt of a negative number just returns a NaN without setting ERRNO.
I've not investigated impact on speed.

// Uncomment this to enable check for error correction +1
//#define FLOAT_CORR_P1

uint32_t sqrt32_float(uint32_t val) {
	uint32_t root = (uint32_t)sqrtf((float) val);
	int32_t rem = val - root * root;
	int32_t corr = rem >> 31;
	#ifdef FLOAT_CORR_P1
	rem = 2*root - rem;
	corr += ((uint32_t)rem) >> 31;
	//printf("%"PRIi32" ", corr);
	root += corr;
	return root;
uint32_t sqrt32_float_double(uint32_t val) {
	return (uint32_t)sqrt((double) val);

uint64_t sqrt64_float_double(uint64_t val) {
	uint64_t root = (uint64_t)sqrt((double) val);
	int64_t rem = val - root * root;
	int64_t corr = rem >> 63;
	#ifdef FLOAT_CORR_P1
	rem = 2*root - rem;
	corr += ((uint64_t)rem) >> 63;
	//printf("%"PRIi64" ", corr);
	root += corr;
	return root;
uint64_t sqrt64_float_longdouble(uint64_t val) {
	return (uint64_t)sqrtl((long double) val);

#define CST2POW64 (65536. * 65536. * 65536. * 65536.)

uint64_t sqrt64x2_float(uint64_t valh, uint64_t vall) {
	long double ld = valh;
	ld = (ld * CST2POW64) + vall;
	uint64_t root = sqrtl(ld);

	uint64_t remh, reml;
	int64_t  corr;

	// Calculate r^2
	umul64x2(&remh, &reml, root, root);
	// Calculate val-r^2
	usub64x2(&remh, &reml, valh, vall, remh, reml);
	// Get the correction -1 if the root is too high
	corr = ((int64_t)remh) >> 63;

	#ifdef FLOAT_CORR_P1
	// Subtract to 2*r to check if the root is too low
	usub64x2(&remh, &reml, root >> 63, root << 1, remh, reml);
	// Get the full correction
	corr += remh >> 63;

	// Apply the error correction
	//printf("%"PRIi64" ", corr);
	root += corr;

	return root;


On Sat, 2017-01-28 at 14:12 +0100, Marco Bodrato wrote:
> Ciao,
> Il Ven, 23 Dicembre 2016 5:22 pm, Adrien Prost-Boucle ha scritto:
> > I now have the same API for my implementation and for the code I've taken
> > from GMP.
> As a contribution to this discussion, I did the same thing, but the other
> direction.
> The attached code contains a copy of mpn_sqrtrem{1,2} from our repository,
> a copy of the sqrtrem2 code that Paul Zimmermann posted to this list (July
> 2016), and an adaptation of Adrien Prost-Boucle's sqrtrem2 to the API and
> sintax currently used by this internal function of GMP.
> The attached file is meant as a replacement for the tune/speed-ext.c file
> in a GMP source tree. After replacement:
> ./configure ABI=64 && make && (cd tune;make speed-ext)
> And the tune/speed-ext program will be ready to compare the speed of the
> three implementations.
> Two limitations: it works only with ABI=64, some measurements seem
> sensible to the "precision" required for measures... this is unexpected to
> me.
> > [bodrato at shell ~/gmp-repo]$ tune/speed-ext -s2 -cp1000000000 mpn_sqrtrem2
> pz_sqrtrem2 apb_sqrtrem2 mpn_sqrtrem
> overhead 5.84 cycles, precision 1000000000 units of 2.86e-10 secs, CPU
> freq 3500.07 MHz
>          mpn_sqrtrem2   pz_sqrtrem2  apb_sqrtrem2   mpn_sqrtrem
> 2              106.08         95.39        #91.21        108.41
> The abp function seems the faster one (at least on shell), but...
> > However don't forget that nothing is fully checked nor proven in my
> > implementation! It may still be wrong in some corner cases!
> This is the biggest issue with the apb implementation... correctness
> should be a prerequisite :-) but the promised speed is interesting.
> Regards,
> m

More information about the gmp-devel mailing list