Anomaly in mpn_sqrtrem and mpn_rottrem

Tue Jun 9 17:02:48 UTC 2015

bodrato at mail.dm.unipi.it writes:

  To compare them I wrote a quick and dirty specialization of the rootrem
  algorithm for the k==2 case (use sqr instead of mul, lshift instead of
  mul_1...)

Cool!

We should probably use sqr whenever possible, perhaps it is worth having
a condition for this in rootrem.c, (I believe mpn_pow_1 gets it right,
and uses sqr whenever possible, since I looked into that as an
explanation.)

  $ tune/speed -p 100000000  -s 1-1100000 -f16 mpn_root.2 mpn_sqrt
  mpn_rootrem.2 mpn_sqrtrem
  overhead 0.000000002 secs, precision 100000000 units of 2.86e-10 secs, CPU
  freq 3500.08 MHz
             mpn_root.2      mpn_sqrt mpn_rootrem.2   mpn_sqrtrem
  1         0.000000518   0.000000011   0.000000516  #0.000000009
  16        0.000001103   0.000000244   0.000001117  #0.000000244
  256       0.000009316  #0.000008321   0.000014106   0.000008374
  4096     #0.000633967   0.000659180   0.000894513   0.000654562
  65536    #0.026071234   0.026756061   0.033005667   0.026762939
  1048576   0.732938835   0.734521311   0.962774086  #0.731488979

  With the attached specialised function, rootrem.2 gains a 10% in speed and
  is faster than sqrt when the reminder is not needed and size grows.

  $ tune/speed -p 100000000  -s 1-1100000 -f16 mpn_root.2 mpn_sqrt
  mpn_rootrem.2 mpn_sqrtrem
  overhead 0.000000002 secs, precision 100000000 units of 2.86e-10 secs, CPU
  freq 3500.08 MHz
             mpn_root.2      mpn_sqrt mpn_rootrem.2   mpn_sqrtrem
  1         0.000000376  #0.000000010   0.000000383   0.000000010
  16        0.000000884  #0.000000241   0.000000894   0.000000243
  256      #0.000007743   0.000008262   0.000012138   0.000008282
  4096     #0.000563298   0.000657382   0.000816271   0.000650694
  65536    #0.023296001   0.026694774   0.030225570   0.026717529
  1048576  #0.654117324   0.736489574   0.832002002   0.733319729

Your timing leaps hide some of the actual timing characteristics...

Using
  tune/speed -r -p10000000 -s16-10000000 -f1.438  mpn_sqrt mpn_root.2
I get about 0.97 on average.

And
  tune/speed -r -p10000000 -s16-10000000 -f1.438  mpn_sqrtrem mpn_rootrem.2
gives around 1.25.

(In both cases without your rootrem hack.)

  I added some TRACE (printf...) in the code to show the primitives with a
  super-linear cost used in computation.
  The last functions called by mpn_sqrtrem when computing the root of a 2000
  limb operand are:

  mpn_divrem nn=500/dn=250 -> qn=250,rn=250
  mpn_sqr an=250 ^ 2 -> rn=500
  mpn_divrem nn=1000/dn=500 -> qn=500,rn=500
  mpn_sqr an=500 ^ 2 -> rn=1000

  Similarly, for mpn_rootrem:

  mpn_div_q nn=500/dn=251 -> qn=249
  mpn_sqr an=501 ^ 2 -> rn=1002
  mpn_div_q nn=1000/dn=501 -> qn=499
  mpn_sqr an=1000 ^ 2 -> rn=2000

Interesting indeed.  I wonder why they need different multiply sizes.

  It uses div_q instead of divrem, but squares used by rootrem are doubled
  in size...

  The operation are similar if compared to sqrt (the order changes),
  probably div_q is faster than divrem. mpn_sqrt wins only when the huge
  speed-up given by the specialised code for the first limb is relevant...

I am not sure that div_q vs divrem matters a whole lot for this usage
(i.e., 2n/n sizes).

-- 
Torbjörn
Please encrypt, key id 0xC8601622