<DKIM> Re: mpn_sqrtrem{1,2}

Sun Jan 29 09:37:51 UTC 2017

Hi,

I forgot to highlight that the floating-point version for 2x64
only works if long double type is 128-bit (quad precision).
Is there a test macro in GMP for that?

Similarly for single 64-bit, if long double actually is double = 64 bits,
then error check after sqrtl() is needed.

There are probably other subtleties about floating-pout stuff
that I am not aware of...

So first I'd like to know,
what do GMP developers think about using FP there?

Adrien

On Sat, 2017-01-28 at 17:49 +0100, Adrien Prost-Boucle wrote:
> Hi Marco,
> 
> Thank you for the tests,
> and for the example of benchmarking with GMP.
> 
> We must be adamant about the correctness issue of course.
> Recently I've been contacted separately by Bradley Lucier,
> who has discovered an error (-1) for a particular input of 64x2 version...
> So yes, correctness issue.
> 
> Bradley Lucier also proposed to try a completely different approach:
> simply use floating-point functions sqrtf(), sqrt(), sqrtl().
> 
> Indeed recent and not-so-recent CPUs have floating-point sqrt instructions.
> Also, if I understand correctly, IEEE standard imposes that the sqrt result
> is nondecreasing as the input value increases.
> This makes it possible to ensure correctness.
> 
> Basically, the algorithm is:
> - convert the input value to floating-point (float, double or long double)
> - call sqrtf(), sqrt() or sqrtl()
> - convert back to integer
> - apply correction -1 or +1 if necessary
> 
> At the end of this message is my code for 32b, 64b and 2x64b.
> By default I check only for correction -1 because it seems only this is necessary,
> due to the nondecreasing property of the sqrt instructions.
> 
> Here is is quick benchmark on my Core2 Duo mobile CPU:
> (speedup is around 2x compared to mpn)
> 
> ./sqrt -noslow bench32   10000000
> null ....... 0.0135976 s
> mpn ........ 0.134436 s
> inv ........ 0.129447 s
> fp-f ....... 0.0614301 s
> fp-d ....... 0.0955542 s
> 
> ./sqrt -noslow bench64   10000000
> null ....... 0.0113118 s
> mpn ........ 0.238126 s
> inv ........ 0.225985 s
> fp-d ....... 0.115913 s
> fp-ld ...... 0.117514 s
> 
> ./sqrt -noslow bench64x2 10000000
> null ....... 0.0123146 s
> mpn ........ 0.709794 s
> inv ........ 0.606813 s
> fp-ld ...... 0.223781 s
> 
> Here is is quick benchmark on a higher-end i7 CPU:
> (speedup is at least 3x compared to mpn)
> 
> ./sqrt -noslow bench32   10000000
> null ....... 0.00276399 s
> mpn ........ 0.0957994 s
> inv ........ 0.0418732 s
> fp-f ....... 0.0194476 s
> fp-d ....... 0.0376165 s
> 
> ./sqrt -noslow bench64   10000000
> null ....... 0.00264621 s
> mpn ........ 0.137777 s
> inv ........ 0.0551834 s
> fp-d ....... 0.0389557 s
> fp-ld ...... 0.0494387 s
> 
> ./sqrt -noslow bench64x2 10000000
> null ....... 0.00257349 s
> mpn ........ 0.181686 s
> inv ........ 0.191645 s
> fp-ld ...... 0.0610073 s
> 
> When compiled with gcc with some optim, the calls to sqrt() are not even present in the compiled program.
> It is "inlined" by the appropriate CPU sqrt instructions.
> So it's very efficient.
> 
> However, floating-point stuff is probably not a good solutions for embedded systems,
> who may not have FP CPU instruction set.
> ARM machines without the NEON extension, in particular.
> So the current fully interger-base code should be kept.
> 
> I have updated my standalone test program with these algorithms:
> http://94.23.21.190/publicshare/sqrt.tar.gz
> make
> make bench
> 
> About my "inv" algorithm:
> I fortuitously discovered that, on some systems and with some compilers,
> the generated code is much slower than "mpn" (Debian with gcc 5.4).
> This is about using int32_t types. Using int_fast32_t instead solved the issue.
> 
> For a quick glance at my code, here it is.
> Bradley Lucier noted that right shift of negative variables is undefined.
> There have already been some discussion in the GMP mailing lists about that,
> so I'll just let you guys decide.
> I compile with -fno-math-errno to tell gcc that sqrt of a negative number just returns a NaN without setting ERRNO.
> I've not investigated impact on speed.
> 
> 
> 
> // Uncomment this to enable check for error correction +1
> //#define FLOAT_CORR_P1
> 
> uint32_t sqrt32_float(uint32_t val) {
> 	uint32_t root = (uint32_t)sqrtf((float) val);
> 	int32_t rem = val - root * root;
> 	int32_t corr = rem >> 31;
> 	#ifdef FLOAT_CORR_P1
> 	rem = 2*root - rem;
> 	corr += ((uint32_t)rem) >> 31;
> 	#endif
> 	//printf("%"PRIi32" ", corr);
> 	root += corr;
> 	return root;
> }
> uint32_t sqrt32_float_double(uint32_t val) {
> 	return (uint32_t)sqrt((double) val);
> }
> 
> uint64_t sqrt64_float_double(uint64_t val) {
> 	uint64_t root = (uint64_t)sqrt((double) val);
> 	int64_t rem = val - root * root;
> 	int64_t corr = rem >> 63;
> 	#ifdef FLOAT_CORR_P1
> 	rem = 2*root - rem;
> 	corr += ((uint64_t)rem) >> 63;
> 	#endif
> 	//printf("%"PRIi64" ", corr);
> 	root += corr;
> 	return root;
> }
> uint64_t sqrt64_float_longdouble(uint64_t val) {
> 	return (uint64_t)sqrtl((long double) val);
> }
> 
> #define CST2POW64 (65536. * 65536. * 65536. * 65536.)
> 
> uint64_t sqrt64x2_float(uint64_t valh, uint64_t vall) {
> 	long double ld = valh;
> 	ld = (ld * CST2POW64) + vall;
> 	uint64_t root = sqrtl(ld);
> 
> 	uint64_t remh, reml;
> 	int64_t  corr;
> 
> 	// Calculate r^2
> 	umul64x2(&remh, &reml, root, root);
> 	// Calculate val-r^2
> 	usub64x2(&remh, &reml, valh, vall, remh, reml);
> 	// Get the correction -1 if the root is too high
> 	corr = ((int64_t)remh) >> 63;
> 
> 	#ifdef FLOAT_CORR_P1
> 	// Subtract to 2*r to check if the root is too low
> 	usub64x2(&remh, &reml, root >> 63, root << 1, remh, reml);
> 	// Get the full correction
> 	corr += remh >> 63;
> 	#endif
> 
> 	// Apply the error correction
> 	//printf("%"PRIi64" ", corr);
> 	root += corr;
> 
> 	return root;
> }
> 
> 
> 
> Adrien
> 
> 
> On Sat, 2017-01-28 at 14:12 +0100, Marco Bodrato wrote:
> > Ciao,
> > 
> > Il Ven, 23 Dicembre 2016 5:22 pm, Adrien Prost-Boucle ha scritto:
> > > I now have the same API for my implementation and for the code I've taken
> > > from GMP.
> > 
> > As a contribution to this discussion, I did the same thing, but the other
> > direction.
> > 
> > The attached code contains a copy of mpn_sqrtrem{1,2} from our repository,
> > a copy of the sqrtrem2 code that Paul Zimmermann posted to this list (July
> > 2016), and an adaptation of Adrien Prost-Boucle's sqrtrem2 to the API and
> > sintax currently used by this internal function of GMP.
> > 
> > The attached file is meant as a replacement for the tune/speed-ext.c file
> > in a GMP source tree. After replacement:
> > 
> > ./configure ABI=64 && make && (cd tune;make speed-ext)
> > 
> > And the tune/speed-ext program will be ready to compare the speed of the
> > three implementations.
> > 
> > Two limitations: it works only with ABI=64, some measurements seem
> > sensible to the "precision" required for measures... this is unexpected to
> > me.
> > 
> > > [bodrato at shell ~/gmp-repo]$ tune/speed-ext -s2 -cp1000000000 mpn_sqrtrem2
> > 
> > pz_sqrtrem2 apb_sqrtrem2 mpn_sqrtrem
> > overhead 5.84 cycles, precision 1000000000 units of 2.86e-10 secs, CPU
> > freq 3500.07 MHz
> >          mpn_sqrtrem2   pz_sqrtrem2  apb_sqrtrem2   mpn_sqrtrem
> > 2              106.08         95.39        #91.21        108.41
> > 
> > The abp function seems the faster one (at least on shell), but...
> > 
> > > However don't forget that nothing is fully checked nor proven in my
> > > implementation! It may still be wrong in some corner cases!
> > 
> > This is the biggest issue with the apb implementation... correctness
> > should be a prerequisite :-) but the promised speed is interesting.
> > 
> > Regards,
> > m
> > 
> 
> _______________________________________________
> gmp-devel mailing list
> gmp-devel at gmplib.org
> https://gmplib.org/mailman/listinfo/gmp-devel