Using mpn_div_qr_2u_pi1?
Niels Möller
nisse at lysator.liu.se
Sat Aug 18 20:00:31 UTC 2018
The below patch enables use of mpn_div_qr_2u_pi1 (written back in 2011).
It's used only if an assembly implementation is available, which
currently is x86_64 only.
The point of it is to avoid a temporary allocation and shift to
normalize the numerator up front.
I'm a bit concerned about performance, it's not as well tuned as
mpn_divrem_2, so might not be a win in current form.
On my intel broadwell machine, mpn_divrem_2 takes 16.5 c/l,
mpn_div_qr_2n takes 17 c/l (and it's essentially the same loop, so this
is poor tuning). mpn_div_qr_2u seem to be about 19.5 cycles, and it uses
the shld instruction which likely is a poor choice on many x86 chips. So
2.5 additional cycles for the shifting, which is pretty expensive. For comparison,
mpn_lshift takes only 1.5 c/l, and it's implemented using sse instructions.
So we'd need to save half a cycle in mpn_div_qr_2n to make it same speed
as mpn_divrem_2, and then do the on-the-fly shifting in mpn_div_qr_2u at
a cost of at most one extra cycle.
Long time since I wrote that code. It's not entirely clear to me if it
the loop is limited by instruction issue or the latency of the critical
path, involving both multiply and quite a few additional arithmetic
instructions.
Regards,
/Niels
diff -r 1ad8cc22b714 configure.ac
--- a/configure.ac Tue Jul 03 11:16:06 2018 +0200
+++ b/configure.ac Sat Aug 18 20:57:02 2018 +0200
@@ -3526,6 +3526,8 @@
#undef HAVE_NATIVE_mpn_copyi
#undef HAVE_NATIVE_mpn_div_qr_1n_pi1
#undef HAVE_NATIVE_mpn_div_qr_2
+#undef HAVE_NATIVE_mpn_div_qr_2n_pi1
+#undef HAVE_NATIVE_mpn_div_qr_2u_pi1
#undef HAVE_NATIVE_mpn_divexact_1
#undef HAVE_NATIVE_mpn_divexact_by3c
#undef HAVE_NATIVE_mpn_divrem_1
diff -r 1ad8cc22b714 mpn/generic/tdiv_qr.c
--- a/mpn/generic/tdiv_qr.c Tue Jul 03 11:16:06 2018 +0200
+++ b/mpn/generic/tdiv_qr.c Sat Aug 18 20:57:02 2018 +0200
@@ -69,7 +69,7 @@
case 2:
{
mp_ptr n2p;
- mp_limb_t qhl, cy;
+ mp_limb_t qhl;
TMP_DECL;
TMP_MARK;
if ((dp[1] & GMP_NUMB_HIGHBIT) == 0)
@@ -80,6 +80,16 @@
cnt -= GMP_NAIL_BITS;
d2p[1] = (dp[1] << cnt) | (dp[0] >> (GMP_NUMB_BITS - cnt));
d2p[0] = (dp[0] << cnt) & GMP_NUMB_MASK;
+#if HAVE_NATIVE_mpn_div_qr_2u_pi1
+ {
+ gmp_pi1_t dinv;
+ invert_pi1 (dinv, d2p[1], d2p[0]);
+ qp[nn-2] = mpn_div_qr_2u_pi1 (qp, rp, np, nn,
+ d2p[1], d2p[0], cnt, dinv.inv32);
+ }
+#else
+ {
+ mp_limb_t cy;
n2p = TMP_ALLOC_LIMBS (nn + 1);
cy = mpn_lshift (n2p, np, nn, cnt);
n2p[nn] = cy;
@@ -90,6 +100,8 @@
| ((n2p[1] << (GMP_NUMB_BITS - cnt)) & GMP_NUMB_MASK);
rp[1] = (n2p[1] >> cnt);
}
+#endif
+ }
else
{
n2p = TMP_ALLOC_LIMBS (nn);
--
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list