Anomaly in mpn_sqrtrem and mpn_rootrem

Tue Aug 18 08:51:49 UTC 2015

"Marco Bodrato" <bodrato at mail.dm.unipi.it> writes:

  With two 256-chars tables(too much?), it is possible to estimate 9 bits
  (seldom the error is +2, previous code always used a single correction).

512 bytes is in total for rootrem tables is OK, IMHO.

  Before we started refining mpn_rootrem (changeset 16672:c86f4fc0aafe) the
  timing (on shell.gmplib) was:

I miss some rootrem logs from ChangeLog.

  $tune/speed -cs 1-1024 -f3 mpn_root.3 mpn_root.32 mpn_root.332 mpn_root.3332
  overhead 5.84 cycles, precision 100000000 units of 2.86e-10 secs, CPU freq
  3500.08 MHz
             mpn_root.3   mpn_root.32  mpn_root.332 mpn_root.3332
  1             2120.44        424.16         84.71        #84.70
  3             2537.38       2255.13         86.57        #86.53
  9             3556.60       4661.17        687.10        #99.87
  27            4779.44       6667.10       7156.09       #151.71
  81            8768.02      13230.91      31229.99      #6717.58
  243          40720.39     #27991.11      72860.77      81112.17
  729         135971.53    #112360.63     240474.62    1060633.50

  The current repo gives:
             mpn_root.3   mpn_root.32  mpn_root.332 mpn_root.3332
  1             1985.55        290.12        #27.20         27.21
  3             2373.10       2015.57         27.21        #27.21
  9             3359.58       4401.62        411.39        #27.21
  27            4626.43       6317.53       6857.23        #27.21
  81            8583.51      12816.22      30837.67      #5783.87
  243          39055.19     #27236.95      71936.69      78244.04
  729         134456.72    #111492.59     235518.65    1054410.50

  The code I'm experimenting with, reduces timings for small results:
             mpn_root.3   mpn_root.32  mpn_root.332 mpn_root.3332
  1              634.94        193.03        #28.22         28.25
  3             1289.25        200.82        #28.17         28.18
  9             1865.79       1341.31        384.23        #28.24
  27            3411.88       3257.38        816.07        #28.24
  81            7518.93       9572.27      18564.51      #1991.75
  243          37723.17      24465.42      61376.28     #20745.56
  729         133654.08    #108648.30     226299.89     768142.04

  I'm still testing it, but I think I'll push it soon...

Wow, awesome speedups!

Reasonable starting values has long been on my TODO list.  I hadn't
realised that one could get away with tiny tables for all k values.  I
had a combination of tables and single-limb iterations in mind.

  > Now, for small sizes, computing the 4th root with mpn_root is slower
  > than calling mpn_sqrt twice.

  Even with the new code, this is true on a wide range (up to 1000 limbs).
  Should we write special code for k=4,8,16 that repeatedly calls mpn_sqrt?

That's worth considering, sure.  Asking the user to do that (and doing
it ourselves from mpz etc) does not make sense, I think.

How much speed difference is there now, for k = 4 vs sqrt(sqrt())?  Is
the difference small enough that we could fix it by running the first
few iterations using plain limb arithmetic?

-- 
Torbjörn
Please encrypt, key id 0xC8601622