mpn_sqrtrem{1,2} - floating-point sqrts{s,d}
Adrien Prost-Boucle
adrien.prost-boucle at laposte.net
Sat Apr 1 14:32:21 UTC 2017
Hi,
On Sat, 2017-03-25 at 21:34 +0100, Torbjörn Granlund wrote:
> The sqrtss and sqrtds are SIMD operations, right? That means that if we
> don't initialise all input fields with something, they might contain
> special values which triggers exceptional conditions.
The Intel docs say that instructions SQRTSS and SQRTSD are scalar.
Citing:
[10.4.1.2 SSE Arithmetic Instructions]
The SQRTSS (compute square root of scalar single-precision floating-point values) instruction computes the square
root of the low single-precision floating-point value in the source operand and stores the result in the low double-
word of the destination operand.
[11.4.1.2 SSE2 Arithmetic Instructions]
The SQRTSD (compute square root of scalar double-precision floating-point values) instruction computes the
square root of the low double-precision floating-point value in the source operand and stores the result in the low
quadword of the destination operand.
> When timing the instructions on an Intel Ivy Bridge, I saw 10 times
> worse performance from sqrtss than from sqrtds (i.e., the opposite
> result from you). I didn't investigare the exact reason, but by
> initilising the entire 128-bit input register, the problem went away
> (and the two instructions both ran at ~9 cycles).
>
> If you can repro your timing results, please indicate platform and send
> me a minimal asm program exhibiting the problem.
I can't explain for sure the slowdown you observe.
However, this reminds me something interesting that happened for me a few months ago.
I wanted to test the perf of a personal code on a big machine where I work.
I expected that machine to be several times faster than my laptop.
But the program was running twice slower! That gave me headaches for days.
A colleague figured it out: running a Debian system, gcc was a bit older than on my laptop.
My code was a tiny routine that used type uint32_t.
Simply changing that into uint64_t or unsigned long solved the problem.
Maybe this is related?
Newer gcc versions may better handle uninitilized stuff, better handle dependencies?
So, what compiler are you using?
I have gcc 6.3.1 and my timings are 100% reproducible.
It is reproducible as well on all my machines, who run up-to-date Archlinux.
Maybe that FP implementation should be enabled only for some versions of the compiler...
Adrien
>
More information about the gmp-devel
mailing list