Using mpn_div_qr_2u_pi1?

Sat Aug 18 20:00:31 UTC 2018

The below patch enables use of mpn_div_qr_2u_pi1 (written back in 2011).
It's used only if an assembly implementation is available, which
currently is x86_64 only.

The point of it is to avoid a temporary allocation and shift to
normalize the numerator up front. 

I'm a bit concerned about performance, it's not as well tuned as
mpn_divrem_2, so might not be a win in current form.

On my intel broadwell machine, mpn_divrem_2 takes 16.5 c/l,
mpn_div_qr_2n takes 17 c/l (and it's essentially the same loop, so this
is poor tuning). mpn_div_qr_2u seem to be about 19.5 cycles, and it uses
the shld instruction which likely is a poor choice on many x86 chips. So
2.5 additional cycles for the shifting, which is pretty expensive. For comparison,
mpn_lshift takes only 1.5 c/l, and it's implemented using sse instructions.

So we'd need to save half a cycle in mpn_div_qr_2n to make it same speed
as mpn_divrem_2, and then do the on-the-fly shifting in mpn_div_qr_2u at
a cost of at most one extra cycle.

Long time since I wrote that code. It's not entirely clear to me if it
the loop is limited by instruction issue or the latency of the critical
path, involving both multiply and quite a few additional arithmetic
instructions.

Regards,
/Niels

diff -r 1ad8cc22b714 configure.ac

--- a/configure.ac	Tue Jul 03 11:16:06 2018 +0200
+++ b/configure.ac	Sat Aug 18 20:57:02 2018 +0200
@@ -3526,6 +3526,8 @@
 #undef HAVE_NATIVE_mpn_copyi
 #undef HAVE_NATIVE_mpn_div_qr_1n_pi1
 #undef HAVE_NATIVE_mpn_div_qr_2
+#undef HAVE_NATIVE_mpn_div_qr_2n_pi1
+#undef HAVE_NATIVE_mpn_div_qr_2u_pi1
 #undef HAVE_NATIVE_mpn_divexact_1
 #undef HAVE_NATIVE_mpn_divexact_by3c
 #undef HAVE_NATIVE_mpn_divrem_1
diff -r 1ad8cc22b714 mpn/generic/tdiv_qr.c
--- a/mpn/generic/tdiv_qr.c	Tue Jul 03 11:16:06 2018 +0200
+++ b/mpn/generic/tdiv_qr.c	Sat Aug 18 20:57:02 2018 +0200
@@ -69,7 +69,7 @@
     case 2:
       {
 	mp_ptr n2p;
-	mp_limb_t qhl, cy;
+	mp_limb_t qhl;
 	TMP_DECL;
 	TMP_MARK;
 	if ((dp[1] & GMP_NUMB_HIGHBIT) == 0)
@@ -80,6 +80,16 @@
 	    cnt -= GMP_NAIL_BITS;
 	    d2p[1] = (dp[1] << cnt) | (dp[0] >> (GMP_NUMB_BITS - cnt));
 	    d2p[0] = (dp[0] << cnt) & GMP_NUMB_MASK;
+#if HAVE_NATIVE_mpn_div_qr_2u_pi1
+	    {
+	      gmp_pi1_t dinv;
+	      invert_pi1 (dinv, d2p[1], d2p[0]);
+	      qp[nn-2] = mpn_div_qr_2u_pi1 (qp, rp, np, nn,
+					    d2p[1], d2p[0], cnt, dinv.inv32);
+	    }
+#else
+	    {
+	      mp_limb_t cy;
 	    n2p = TMP_ALLOC_LIMBS (nn + 1);
 	    cy = mpn_lshift (n2p, np, nn, cnt);
 	    n2p[nn] = cy;
@@ -90,6 +100,8 @@
 	      | ((n2p[1] << (GMP_NUMB_BITS - cnt)) & GMP_NUMB_MASK);
 	    rp[1] = (n2p[1] >> cnt);
 	  }
+#endif
+	  }
 	else
 	  {
 	    n2p = TMP_ALLOC_LIMBS (nn);

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.