<DKIM> Re: mpn_sqrtrem{1,2}
Adrien Prost-Boucle
adrien.prost-boucle at laposte.net
Sun Jan 29 09:37:51 UTC 2017
Hi,
I forgot to highlight that the floating-point version for 2x64
only works if long double type is 128-bit (quad precision).
Is there a test macro in GMP for that?
Similarly for single 64-bit, if long double actually is double = 64 bits,
then error check after sqrtl() is needed.
There are probably other subtleties about floating-pout stuff
that I am not aware of...
So first I'd like to know,
what do GMP developers think about using FP there?
Adrien
On Sat, 2017-01-28 at 17:49 +0100, Adrien Prost-Boucle wrote:
> Hi Marco,
>
> Thank you for the tests,
> and for the example of benchmarking with GMP.
>
> We must be adamant about the correctness issue of course.
> Recently I've been contacted separately by Bradley Lucier,
> who has discovered an error (-1) for a particular input of 64x2 version...
> So yes, correctness issue.
>
> Bradley Lucier also proposed to try a completely different approach:
> simply use floating-point functions sqrtf(), sqrt(), sqrtl().
>
> Indeed recent and not-so-recent CPUs have floating-point sqrt instructions.
> Also, if I understand correctly, IEEE standard imposes that the sqrt result
> is nondecreasing as the input value increases.
> This makes it possible to ensure correctness.
>
> Basically, the algorithm is:
> - convert the input value to floating-point (float, double or long double)
> - call sqrtf(), sqrt() or sqrtl()
> - convert back to integer
> - apply correction -1 or +1 if necessary
>
> At the end of this message is my code for 32b, 64b and 2x64b.
> By default I check only for correction -1 because it seems only this is necessary,
> due to the nondecreasing property of the sqrt instructions.
>
> Here is is quick benchmark on my Core2 Duo mobile CPU:
> (speedup is around 2x compared to mpn)
>
> ./sqrt -noslow bench32 10000000
> null ....... 0.0135976 s
> mpn ........ 0.134436 s
> inv ........ 0.129447 s
> fp-f ....... 0.0614301 s
> fp-d ....... 0.0955542 s
>
> ./sqrt -noslow bench64 10000000
> null ....... 0.0113118 s
> mpn ........ 0.238126 s
> inv ........ 0.225985 s
> fp-d ....... 0.115913 s
> fp-ld ...... 0.117514 s
>
> ./sqrt -noslow bench64x2 10000000
> null ....... 0.0123146 s
> mpn ........ 0.709794 s
> inv ........ 0.606813 s
> fp-ld ...... 0.223781 s
>
> Here is is quick benchmark on a higher-end i7 CPU:
> (speedup is at least 3x compared to mpn)
>
> ./sqrt -noslow bench32 10000000
> null ....... 0.00276399 s
> mpn ........ 0.0957994 s
> inv ........ 0.0418732 s
> fp-f ....... 0.0194476 s
> fp-d ....... 0.0376165 s
>
> ./sqrt -noslow bench64 10000000
> null ....... 0.00264621 s
> mpn ........ 0.137777 s
> inv ........ 0.0551834 s
> fp-d ....... 0.0389557 s
> fp-ld ...... 0.0494387 s
>
> ./sqrt -noslow bench64x2 10000000
> null ....... 0.00257349 s
> mpn ........ 0.181686 s
> inv ........ 0.191645 s
> fp-ld ...... 0.0610073 s
>
> When compiled with gcc with some optim, the calls to sqrt() are not even present in the compiled program.
> It is "inlined" by the appropriate CPU sqrt instructions.
> So it's very efficient.
>
> However, floating-point stuff is probably not a good solutions for embedded systems,
> who may not have FP CPU instruction set.
> ARM machines without the NEON extension, in particular.
> So the current fully interger-base code should be kept.
>
> I have updated my standalone test program with these algorithms:
> http://94.23.21.190/publicshare/sqrt.tar.gz
> make
> make bench
>
> About my "inv" algorithm:
> I fortuitously discovered that, on some systems and with some compilers,
> the generated code is much slower than "mpn" (Debian with gcc 5.4).
> This is about using int32_t types. Using int_fast32_t instead solved the issue.
>
> For a quick glance at my code, here it is.
> Bradley Lucier noted that right shift of negative variables is undefined.
> There have already been some discussion in the GMP mailing lists about that,
> so I'll just let you guys decide.
> I compile with -fno-math-errno to tell gcc that sqrt of a negative number just returns a NaN without setting ERRNO.
> I've not investigated impact on speed.
>
>
>
> // Uncomment this to enable check for error correction +1
> //#define FLOAT_CORR_P1
>
> uint32_t sqrt32_float(uint32_t val) {
> uint32_t root = (uint32_t)sqrtf((float) val);
> int32_t rem = val - root * root;
> int32_t corr = rem >> 31;
> #ifdef FLOAT_CORR_P1
> rem = 2*root - rem;
> corr += ((uint32_t)rem) >> 31;
> #endif
> //printf("%"PRIi32" ", corr);
> root += corr;
> return root;
> }
> uint32_t sqrt32_float_double(uint32_t val) {
> return (uint32_t)sqrt((double) val);
> }
>
> uint64_t sqrt64_float_double(uint64_t val) {
> uint64_t root = (uint64_t)sqrt((double) val);
> int64_t rem = val - root * root;
> int64_t corr = rem >> 63;
> #ifdef FLOAT_CORR_P1
> rem = 2*root - rem;
> corr += ((uint64_t)rem) >> 63;
> #endif
> //printf("%"PRIi64" ", corr);
> root += corr;
> return root;
> }
> uint64_t sqrt64_float_longdouble(uint64_t val) {
> return (uint64_t)sqrtl((long double) val);
> }
>
> #define CST2POW64 (65536. * 65536. * 65536. * 65536.)
>
> uint64_t sqrt64x2_float(uint64_t valh, uint64_t vall) {
> long double ld = valh;
> ld = (ld * CST2POW64) + vall;
> uint64_t root = sqrtl(ld);
>
> uint64_t remh, reml;
> int64_t corr;
>
> // Calculate r^2
> umul64x2(&remh, &reml, root, root);
> // Calculate val-r^2
> usub64x2(&remh, &reml, valh, vall, remh, reml);
> // Get the correction -1 if the root is too high
> corr = ((int64_t)remh) >> 63;
>
> #ifdef FLOAT_CORR_P1
> // Subtract to 2*r to check if the root is too low
> usub64x2(&remh, &reml, root >> 63, root << 1, remh, reml);
> // Get the full correction
> corr += remh >> 63;
> #endif
>
> // Apply the error correction
> //printf("%"PRIi64" ", corr);
> root += corr;
>
> return root;
> }
>
>
>
> Adrien
>
>
> On Sat, 2017-01-28 at 14:12 +0100, Marco Bodrato wrote:
> > Ciao,
> >
> > Il Ven, 23 Dicembre 2016 5:22 pm, Adrien Prost-Boucle ha scritto:
> > > I now have the same API for my implementation and for the code I've taken
> > > from GMP.
> >
> > As a contribution to this discussion, I did the same thing, but the other
> > direction.
> >
> > The attached code contains a copy of mpn_sqrtrem{1,2} from our repository,
> > a copy of the sqrtrem2 code that Paul Zimmermann posted to this list (July
> > 2016), and an adaptation of Adrien Prost-Boucle's sqrtrem2 to the API and
> > sintax currently used by this internal function of GMP.
> >
> > The attached file is meant as a replacement for the tune/speed-ext.c file
> > in a GMP source tree. After replacement:
> >
> > ./configure ABI=64 && make && (cd tune;make speed-ext)
> >
> > And the tune/speed-ext program will be ready to compare the speed of the
> > three implementations.
> >
> > Two limitations: it works only with ABI=64, some measurements seem
> > sensible to the "precision" required for measures... this is unexpected to
> > me.
> >
> > > [bodrato at shell ~/gmp-repo]$ tune/speed-ext -s2 -cp1000000000 mpn_sqrtrem2
> >
> > pz_sqrtrem2 apb_sqrtrem2 mpn_sqrtrem
> > overhead 5.84 cycles, precision 1000000000 units of 2.86e-10 secs, CPU
> > freq 3500.07 MHz
> > mpn_sqrtrem2 pz_sqrtrem2 apb_sqrtrem2 mpn_sqrtrem
> > 2 106.08 95.39 #91.21 108.41
> >
> > The abp function seems the faster one (at least on shell), but...
> >
> > > However don't forget that nothing is fully checked nor proven in my
> > > implementation! It may still be wrong in some corner cases!
> >
> > This is the biggest issue with the apb implementation... correctness
> > should be a prerequisite :-) but the promised speed is interesting.
> >
> > Regards,
> > m
> >
>
> _______________________________________________
> gmp-devel mailing list
> gmp-devel at gmplib.org
> https://gmplib.org/mailman/listinfo/gmp-devel
More information about the gmp-devel
mailing list