mpn_sqrtrem{1,2}

Sat Mar 25 20:34:47 UTC 2017

Adrien Prost-Boucle <adrien.prost-boucle at laposte.net> writes:

  As far as I know, these instructions are affected bu rounding mode.
  And no instruction specifies the rounding mode.

  So, I have to assume worst-case and consider the user programs may
  have set a rounding mode that I don't expect.
  That's why I propose - as a first proposal - a function that always
  checks the output and applies a correction, +1 or -1.

I still thnk that could be avoided.

Say that we pre-multiply the input by, say, 1-2^{-52}, then perform the
square root, then truncate to mp_limb_t.  That ought to preclude any
results which are too large.

  About why float versus double for 32-bit inputs:
  I did some benchmarking, 2 versions:
  - use float and check for correction +1 or -1
  - use double and don't do correction
  The float+correction is always faster.
  Similarly for 64-bit inputs, I tested double+correction and long double
  with no correction, and double+correction is always faster.

I am unconvinced of these results' correctness.

The sqrtss and sqrtds are SIMD operations, right?  That means that if we
don't initialise all input fields with something, they might contain
special values which triggers exceptional conditions.

When timing the instructions on an Intel Ivy Bridge, I saw 10 times
worse performance from sqrtss than from sqrtds (i.e., the opposite
result from you).  I didn't investigare the exact reason, but by
initilising the entire 128-bit input register, the problem went away
(and the two instructions both ran at ~9 cycles).

If you can repro your timing results, please indicate platform and send
me a minimal asm program exhibiting the problem.

  Now, why float with 23-bit mantissa is enough for 32-bit inputs:
  a change on the 16 least significant bits changes the root by only +/-1.
  The mantissa may be too short by 9 bits, that'll always be covered by
  the +/-1 correction.

Sure, I understand that the result will be close even for this, But
sqrtds will in practice always give the correct result (and for all
rounding modes assuming the input is pre-multiply as per my suggestion).

-- 
Torbjörn
Please encrypt, key id 0xC8601622