udiv_qr_3by2 vs divappr

Tue Sep 4 20:16:39 UTC 2018

nisse at lysator.liu.se (Niels Möller) writes:

> The new version gets the stricter bounds right, hopefully, and it also
> replaces [p_0 > 0] by 1, simplifying the algorithm.

If we can get away with this (both tests and analysis seems to support
it), we get the updated patch, appended below. I've tried benchmarking
on shell, using

$ ./speed-orig -c -s 3-50 -p 1000000 mpn_sbpi1_div_qr > out-orig.txt
$ ./speed-divappr -c -s 3-50 -p 1000000 mpn_sbpi1_div_qr > out-divappr.txt
$ paste out-orig.txt out-divappr.txt | tail +3 | awk '{print $1, $2, $4, $2 / $4, $2 - $4}'
3 160.34 128.47 1.24807 31.87
4 222.39 197.26 1.1274 25.13
5 288.00 262.63 1.0966 25.37
6 362.30 338.90 1.06905 23.4
7 446.91 403.06 1.10879 43.85
8 521.45 481.38 1.08324 40.07
9 610.65 541.38 1.12795 69.27
10 713.42 663.70 1.07491 49.72
11 815.16 734.66 1.10957 80.5
12 902.98 832.67 1.08444 70.31
13 1044.39 937.28 1.11428 107.11
14 1127.26 1052.10 1.07144 75.16
15 1274.56 1205.91 1.05693 68.65
16 1378.21 1305.14 1.05599 73.07
17 1523.96 1451.64 1.04982 72.32
18 1668.39 1581.88 1.05469 86.51
19 1814.28 1709.13 1.06152 105.15
20 1956.08 1863.17 1.04987 92.91
21 2142.21 2007.66 1.06702 134.55
22 2276.47 2148.40 1.05961 128.07
23 2484.44 2316.32 1.07258 168.12
24 2595.90 2493.73 1.04097 102.17
25 2819.39 2688.60 1.04865 130.79
26 2996.90 2844.15 1.05371 152.75
27 3171.11 3039.36 1.04335 131.75
28 3352.41 3217.94 1.04179 134.47
29 3565.41 3426.87 1.04043 138.54
30 3784.60 3624.18 1.04426 160.42
31 3995.90 3844.40 1.03941 151.5
32 4252.88 4061.82 1.04704 191.06
33 4567.11 4312.97 1.05892 254.14
34 4678.86 4500.24 1.03969 178.62
35 4927.72 4752.67 1.03683 175.05
36 5141.81 4968.25 1.03493 173.56
37 5406.31 5243.24 1.0311 163.07
38 5673.23 5489.52 1.03347 183.71
39 5979.83 5745.94 1.04071 233.89
40 6183.17 5973.88 1.03503 209.29
41 6465.40 6274.09 1.03049 191.31
42 6856.47 6509.70 1.05327 346.77
43 7162.40 6812.58 1.05135 349.82
44 7309.28 7113.83 1.02747 195.45
45 7634.08 7411.09 1.03009 222.99
46 8085.24 7697.98 1.05031 387.26
47 8358.55 8010.33 1.04347 348.22
48 8631.03 8318.25 1.0376 312.78
49 8873.02 8639.46 1.02703 233.56
50 9221.23 8956.78 1.02953 264.45

This indicates 10% speedup or more up to a dozen limbs, then slowly
diminishing as the quadratic work dominates more. So we save a fair
amount of cycles.

I had hoped to see larger speedup for an unnormalized variant. Now, I
doubt that, since

  (i) upfront shift seems to be very efficient at least on x86_64, using
      simd instructions, and

 (ii) mpn_tdiv_qr doesn't allow any clobbering of the np area, forcing a
      copy even if we don't need any shift (the docs say that np == rp
      is allowed, but doesn't say anything about clobbering the rest of
      the n area in this case).

Regards,
/Niels

diff -r 1ad8cc22b714 mpn/generic/sbpi1_div_qr.c

--- a/mpn/generic/sbpi1_div_qr.c	Tue Jul 03 11:16:06 2018 +0200
+++ b/mpn/generic/sbpi1_div_qr.c	Tue Sep 04 21:54:30 2018 +0200
@@ -39,6 +39,92 @@
 #include "gmp-impl.h"
 #include "longlong.h"
 
+
+#if 1
+static inline mp_limb_t
+divappr_2 (mp_limb_t n2, mp_limb_t n1,
+	   mp_limb_t d1, mp_limb_t d0, mp_limb_t dinv)
+{
+  mp_limb_t q1, q0, r1, t1, t0, mask;
+
+  umul_ppmm (q1, q0, n2, dinv);
+  add_ssaaaa (q1, q0, q1, q0, n2, n1);
+
+  q1++;
+  umul_ppmm (t1, t0, q1, d0);
+  r1 = n1 - d1 * q1 - t1 - 1;
+
+  mask = - (mp_limb_t) (r1 >= q0);
+  q1 += mask;
+  r1 += mask & (d1 + 1);
+  q1 += (r1 >= d1 - 1);
+
+  return q1;
+}
+
+mp_limb_t
+mpn_sbpi1_div_qr (mp_ptr qp,
+		  mp_ptr np, mp_size_t nn,
+		  mp_srcptr dp, mp_size_t dn,
+		  mp_limb_t dinv)
+{
+  mp_limb_t qh;
+  mp_size_t i;
+  mp_limb_t d1, d0, m1, m0;
+
+  ASSERT (dn > 2);
+  ASSERT (nn >= dn);
+  ASSERT ((dp[dn-1] & GMP_NUMB_HIGHBIT) != 0);
+
+  np += nn;
+
+  qh = mpn_cmp (np - dn, dp, dn) >= 0;
+  if (qh != 0)
+    mpn_sub_n (np - dn, np - dn, dp, dn);
+
+  qp += nn - dn;
+
+  d1 = dp[dn - 1];
+  d0 = dp[dn - 2];
+
+  /* {m1, m0} = {d1, d0} - d1 */
+  m1 = d1 - (d0 < d1);
+  m0 = d0 - d1;
+
+  for (i = nn - dn; i > 0; i--)
+    {
+      mp_limb_t n1, n0;
+      mp_limb_t cy;
+      mp_limb_t q;
+      np--;
+      n1 = np[0];
+      n0 = np[-1];
+
+      if (UNLIKELY (n1 >= m1) && (n1 > m1 || n0 >= m0))
+	q = GMP_NUMB_MASK;
+      else
+	q = divappr_2 (n1, n0, d1, d0, dinv);
+
+      cy = mpn_submul_1 (np - dn, dp, dn, q);
+      ASSERT (cy >= n1);
+
+      if (UNLIKELY (cy != n1))
+	{
+	  ASSERT (cy - 1 == n1);
+	  ASSERT_CARRY(mpn_add_n (np - dn, np - dn, dp, dn));
+	  q--;
+	}
+
+      *--qp = q;
+    }
+
+  return qh;
+}
+
+#else
+
 mp_limb_t
 mpn_sbpi1_div_qr (mp_ptr qp,
 		  mp_ptr np, mp_size_t nn,
@@ -107,3 +193,4 @@
 
   return qh;
 }
+#endif


-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.