mpn_sqrtrem{1,2} - floating-point sqrts{s,d}

Sat Apr 1 14:32:21 UTC 2017

Hi,

On Sat, 2017-03-25 at 21:34 +0100, Torbjörn Granlund wrote:
> The sqrtss and sqrtds are SIMD operations, right?  That means that if we
> don't initialise all input fields with something, they might contain
> special values which triggers exceptional conditions.

The Intel docs say that instructions SQRTSS and SQRTSD are scalar.
Citing:

[10.4.1.2 SSE Arithmetic Instructions]
The SQRTSS (compute square root of scalar single-precision floating-point values) instruction computes the square
root of the low single-precision floating-point value in the source operand and stores the result in the low double-
word of the destination operand.

[11.4.1.2 SSE2 Arithmetic Instructions]
The SQRTSD (compute square root of scalar double-precision floating-point values) instruction computes the
square root of the low double-precision floating-point value in the source operand and stores the result in the low
quadword of the destination operand.

> When timing the instructions on an Intel Ivy Bridge, I saw 10 times
> worse performance from sqrtss than from sqrtds (i.e., the opposite
> result from you).  I didn't investigare the exact reason, but by
> initilising the entire 128-bit input register, the problem went away
> (and the two instructions both ran at ~9 cycles).
> 
> If you can repro your timing results, please indicate platform and send
> me a minimal asm program exhibiting the problem.

I can't explain for sure the slowdown you observe.
However, this reminds me something interesting that happened for me a few months ago.

I wanted to test the perf of a personal code on a big machine where I work.
I expected that machine to be several times faster than my laptop.
But the program was running twice slower! That gave me headaches for days.
A colleague figured it out: running a Debian system, gcc was a bit older than on my laptop.
My code was a tiny routine that used type uint32_t.
Simply changing that into uint64_t or unsigned long solved the problem.

Maybe this is related?
Newer gcc versions may better handle uninitilized stuff, better handle dependencies?

So, what compiler are you using?
I have gcc 6.3.1 and my timings are 100% reproducible.
It is reproducible as well on all my machines, who run up-to-date Archlinux.

Maybe that FP implementation should be enabled only for some versions of the compiler...

Adrien

>