mpn_sqrtrem{1,2} - patch for pure C implem
Adrien Prost-Boucle
adrien.prost-boucle at laposte.net
Tue Mar 28 19:18:05 UTC 2017
Hi,
On Sat, 2017-03-25 at 15:38 +0100, Marco Bodrato wrote:
> Il Gio, 23 Marzo 2017 8:46 pm, Adrien Prost-Boucle ha scritto:
> > > About the pure C code, integer version that was working on,
> > > But... when I put that code in GMP code, that resulted in
> > > a noticeable slowdown /o\
> > Problem solved.
> > Branch prediction made GMP's sqrtrem1 appear faster than it actually
> > is on normal pseudo-random workloads.
> > So, I have a working and exhaustively tested C version for sqrtrem1,
> > that is slightly faster than GMP's.
> > Patch coming soon.
>
> I'll be happy to examine it.
The diff is at the end of this message. Based on rev 17327.
It's a bit ugly with interleaved ++ and -- ...
Basically, remove previous mpn_sqrtrem1 and replace by my 2 versions, 32 and 64b.
> Your observations ask for some investigation... Is your version faster
> because of a faster core-sequence or thanks to an improved handling of the
> possible branches? Can the proposed branch structure be applied also to
> the current code, or it's strictly linked to the new core?
The only branch is for the final correction at end of mpn_sqrtrem1.
I tried with the previous mpn_sqrtrem1 version, which has a condition,
and with an unconditional code that needs 2 multiplications.
My version with unconditional correction had same speed.
My version with current mpn_sqrtrem1 correction is slightly faster.
So it's the "core" that is faster.
I think it's far from perfect.
Look at variable "accu" in 32b and 64b versions, I needed different signedness.
The direction of the correction is different too.
But this might be normal, linked 64b version having one more iteration compared to 32b...
> We shall probably move to SQRTREM1_NEEDNORM=0 (notation from Adrien's
> patch) as soon as sqrtrem2 do not need rem1 any more, to avoid, at least,
> the double computation of the reminder for non-normalised inputs.
It seems efficient pure C implementations need the table of invsqrt,
and consequently needs normalized input.
Of course for machine-specific code, NEEDNORM=0 has advantages.
Adrien
diff -r 6393a907d1b5 mpn/generic/sqrtrem.c
--- a/mpn/generic/sqrtrem.c Wed Mar 08 19:57:44 2017 +0100
+++ b/mpn/generic/sqrtrem.c Tue Mar 28 21:00:15 2017 +0200
@@ -49,121 +49,144 @@
#define USE_DIVAPPR_Q 1
#define TRACE(x)
-static const unsigned char invsqrttab[384] = /* The common 0x100 was removed */
-{
- 0xff,0xfd,0xfb,0xf9,0xf7,0xf5,0xf3,0xf2, /* sqrt(1/80)..sqrt(1/87) */
- 0xf0,0xee,0xec,0xea,0xe9,0xe7,0xe5,0xe4, /* sqrt(1/88)..sqrt(1/8f) */
- 0xe2,0xe0,0xdf,0xdd,0xdb,0xda,0xd8,0xd7, /* sqrt(1/90)..sqrt(1/97) */
- 0xd5,0xd4,0xd2,0xd1,0xcf,0xce,0xcc,0xcb, /* sqrt(1/98)..sqrt(1/9f) */
- 0xc9,0xc8,0xc6,0xc5,0xc4,0xc2,0xc1,0xc0, /* sqrt(1/a0)..sqrt(1/a7) */
- 0xbe,0xbd,0xbc,0xba,0xb9,0xb8,0xb7,0xb5, /* sqrt(1/a8)..sqrt(1/af) */
- 0xb4,0xb3,0xb2,0xb0,0xaf,0xae,0xad,0xac, /* sqrt(1/b0)..sqrt(1/b7) */
- 0xaa,0xa9,0xa8,0xa7,0xa6,0xa5,0xa4,0xa3, /* sqrt(1/b8)..sqrt(1/bf) */
- 0xa2,0xa0,0x9f,0x9e,0x9d,0x9c,0x9b,0x9a, /* sqrt(1/c0)..sqrt(1/c7) */
- 0x99,0x98,0x97,0x96,0x95,0x94,0x93,0x92, /* sqrt(1/c8)..sqrt(1/cf) */
- 0x91,0x90,0x8f,0x8e,0x8d,0x8c,0x8c,0x8b, /* sqrt(1/d0)..sqrt(1/d7) */
- 0x8a,0x89,0x88,0x87,0x86,0x85,0x84,0x83, /* sqrt(1/d8)..sqrt(1/df) */
- 0x83,0x82,0x81,0x80,0x7f,0x7e,0x7e,0x7d, /* sqrt(1/e0)..sqrt(1/e7) */
- 0x7c,0x7b,0x7a,0x79,0x79,0x78,0x77,0x76, /* sqrt(1/e8)..sqrt(1/ef) */
- 0x76,0x75,0x74,0x73,0x72,0x72,0x71,0x70, /* sqrt(1/f0)..sqrt(1/f7) */
- 0x6f,0x6f,0x6e,0x6d,0x6d,0x6c,0x6b,0x6a, /* sqrt(1/f8)..sqrt(1/ff) */
- 0x6a,0x69,0x68,0x68,0x67,0x66,0x66,0x65, /* sqrt(1/100)..sqrt(1/107) */
- 0x64,0x64,0x63,0x62,0x62,0x61,0x60,0x60, /* sqrt(1/108)..sqrt(1/10f) */
- 0x5f,0x5e,0x5e,0x5d,0x5c,0x5c,0x5b,0x5a, /* sqrt(1/110)..sqrt(1/117) */
- 0x5a,0x59,0x59,0x58,0x57,0x57,0x56,0x56, /* sqrt(1/118)..sqrt(1/11f) */
- 0x55,0x54,0x54,0x53,0x53,0x52,0x52,0x51, /* sqrt(1/120)..sqrt(1/127) */
- 0x50,0x50,0x4f,0x4f,0x4e,0x4e,0x4d,0x4d, /* sqrt(1/128)..sqrt(1/12f) */
- 0x4c,0x4b,0x4b,0x4a,0x4a,0x49,0x49,0x48, /* sqrt(1/130)..sqrt(1/137) */
- 0x48,0x47,0x47,0x46,0x46,0x45,0x45,0x44, /* sqrt(1/138)..sqrt(1/13f) */
- 0x44,0x43,0x43,0x42,0x42,0x41,0x41,0x40, /* sqrt(1/140)..sqrt(1/147) */
- 0x40,0x3f,0x3f,0x3e,0x3e,0x3d,0x3d,0x3c, /* sqrt(1/148)..sqrt(1/14f) */
- 0x3c,0x3b,0x3b,0x3a,0x3a,0x39,0x39,0x39, /* sqrt(1/150)..sqrt(1/157) */
- 0x38,0x38,0x37,0x37,0x36,0x36,0x35,0x35, /* sqrt(1/158)..sqrt(1/15f) */
- 0x35,0x34,0x34,0x33,0x33,0x32,0x32,0x32, /* sqrt(1/160)..sqrt(1/167) */
- 0x31,0x31,0x30,0x30,0x2f,0x2f,0x2f,0x2e, /* sqrt(1/168)..sqrt(1/16f) */
- 0x2e,0x2d,0x2d,0x2d,0x2c,0x2c,0x2b,0x2b, /* sqrt(1/170)..sqrt(1/177) */
- 0x2b,0x2a,0x2a,0x29,0x29,0x29,0x28,0x28, /* sqrt(1/178)..sqrt(1/17f) */
- 0x27,0x27,0x27,0x26,0x26,0x26,0x25,0x25, /* sqrt(1/180)..sqrt(1/187) */
- 0x24,0x24,0x24,0x23,0x23,0x23,0x22,0x22, /* sqrt(1/188)..sqrt(1/18f) */
- 0x21,0x21,0x21,0x20,0x20,0x20,0x1f,0x1f, /* sqrt(1/190)..sqrt(1/197) */
- 0x1f,0x1e,0x1e,0x1e,0x1d,0x1d,0x1d,0x1c, /* sqrt(1/198)..sqrt(1/19f) */
- 0x1c,0x1b,0x1b,0x1b,0x1a,0x1a,0x1a,0x19, /* sqrt(1/1a0)..sqrt(1/1a7) */
- 0x19,0x19,0x18,0x18,0x18,0x18,0x17,0x17, /* sqrt(1/1a8)..sqrt(1/1af) */
- 0x17,0x16,0x16,0x16,0x15,0x15,0x15,0x14, /* sqrt(1/1b0)..sqrt(1/1b7) */
- 0x14,0x14,0x13,0x13,0x13,0x12,0x12,0x12, /* sqrt(1/1b8)..sqrt(1/1bf) */
- 0x12,0x11,0x11,0x11,0x10,0x10,0x10,0x0f, /* sqrt(1/1c0)..sqrt(1/1c7) */
- 0x0f,0x0f,0x0f,0x0e,0x0e,0x0e,0x0d,0x0d, /* sqrt(1/1c8)..sqrt(1/1cf) */
- 0x0d,0x0c,0x0c,0x0c,0x0c,0x0b,0x0b,0x0b, /* sqrt(1/1d0)..sqrt(1/1d7) */
- 0x0a,0x0a,0x0a,0x0a,0x09,0x09,0x09,0x09, /* sqrt(1/1d8)..sqrt(1/1df) */
- 0x08,0x08,0x08,0x07,0x07,0x07,0x07,0x06, /* sqrt(1/1e0)..sqrt(1/1e7) */
- 0x06,0x06,0x06,0x05,0x05,0x05,0x04,0x04, /* sqrt(1/1e8)..sqrt(1/1ef) */
- 0x04,0x04,0x03,0x03,0x03,0x03,0x02,0x02, /* sqrt(1/1f0)..sqrt(1/1f7) */
- 0x02,0x02,0x01,0x01,0x01,0x01,0x00,0x00 /* sqrt(1/1f8)..sqrt(1/1ff) */
+/* Table of pre-calculated inverse square root 1/sqrt(a)
+ Note: x must be normalized, i.e. the 2 MSBs are not both zero
+ Access the table by taking the 9 MSBs of the input value, minus 128
+ Then add 256 to the table value
+ For 32-bit values: values in the table are shifted 24 bits (multiplied by 2^24)
+ For 64-bit values: values in the table are shifted 40 bits (multiplied by 2^40) */
+static const unsigned char invsqrttab[384] = {
+ 255, 253, 251, 249, 247, 245, 243, 242, 240, 238, 236, 234, 233, 231, 229, 227,
+ 226, 224, 222, 221, 219, 218, 216, 215, 213, 211, 210, 208, 207, 205, 204, 203,
+ 201, 200, 198, 197, 195, 194, 193, 191, 190, 189, 187, 186, 185, 184, 182, 181,
+ 180, 179, 177, 176, 175, 174, 173, 171, 170, 169, 168, 167, 166, 165, 164, 162,
+ 161, 160, 159, 158, 157, 156, 155, 154, 153, 152, 151, 150, 149, 148, 147, 146,
+ 145, 144, 143, 142, 141, 140, 139, 138, 138, 137, 136, 135, 134, 133, 132, 131,
+ 130, 130, 129, 128, 127, 126, 125, 125, 124, 123, 122, 121, 121, 120, 119, 118,
+ 117, 117, 116, 115, 114, 114, 113, 112, 111, 111, 110, 109, 108, 108, 107, 106,
+ 106, 105, 104, 103, 103, 102, 101, 101, 100, 99, 99, 98, 97, 97, 96, 95,
+ 95, 94, 94, 93, 92, 92, 91, 90, 90, 89, 89, 88, 87, 87, 86, 86,
+ 85, 84, 84, 83, 83, 82, 81, 81, 80, 80, 79, 79, 78, 78, 77, 76,
+ 76, 75, 75, 74, 74, 73, 73, 72, 72, 71, 71, 70, 70, 69, 69, 68,
+ 67, 67, 66, 66, 65, 65, 65, 64, 64, 63, 63, 62, 62, 61, 61, 60,
+ 60, 59, 59, 58, 58, 57, 57, 56, 56, 56, 55, 55, 54, 54, 53, 53,
+ 52, 52, 52, 51, 51, 50, 50, 49, 49, 49, 48, 48, 47, 47, 47, 46,
+ 46, 45, 45, 45, 44, 44, 43, 43, 43, 42, 42, 41, 41, 41, 40, 40,
+ 39, 39, 39, 38, 38, 37, 37, 37, 36, 36, 36, 35, 35, 35, 34, 34,
+ 33, 33, 33, 32, 32, 32, 31, 31, 31, 30, 30, 30, 29, 29, 29, 28,
+ 28, 27, 27, 27, 26, 26, 26, 25, 25, 25, 24, 24, 24, 24, 23, 23,
+ 23, 22, 22, 22, 21, 21, 21, 20, 20, 20, 19, 19, 19, 18, 18, 18,
+ 18, 17, 17, 17, 16, 16, 16, 15, 15, 15, 15, 14, 14, 14, 13, 13,
+ 13, 13, 12, 12, 12, 11, 11, 11, 11, 10, 10, 10, 9, 9, 9, 9,
+ 8, 8, 8, 7, 7, 7, 7, 6, 6, 6, 6, 5, 5, 5, 5, 4,
+ 4, 4, 3, 3, 3, 3, 2, 2, 2, 2, 1, 1, 1, 1, 0, 0,
};
-/* Compute s = floor(sqrt(a0)), and *rp = a0 - s^2. */
+/* mpn_sqrtrem1: Compute s = floor(sqrt(a0)), and *rp = a0 - s^2 */
-#if GMP_NUMB_BITS > 32
-#define MAGIC CNST_LIMB(0x10000000000) /* 0xffe7debbfc < MAGIC < 0x232b1850f410 */
-#else
-#define MAGIC CNST_LIMB(0x100000) /* 0xfee6f < MAGIC < 0x29cbc8 */
-#endif
+#if GMP_NUMB_BITS <= 32
+/* 32-bit code */
static mp_limb_t
mpn_sqrtrem1 (mp_ptr rp, mp_limb_t a0)
{
-#if GMP_NUMB_BITS > 32
- mp_limb_t a1;
-#endif
- mp_limb_t x0, t2, t, x2;
- unsigned abits;
+ mp_limb_t root, invroot;
+ mp_limb_signed_t accu;
ASSERT_ALWAYS (GMP_NAIL_BITS == 0);
- ASSERT_ALWAYS (GMP_LIMB_BITS == 32 || GMP_LIMB_BITS == 64);
+ ASSERT_ALWAYS (GMP_LIMB_BITS == 32);
ASSERT (a0 >= GMP_NUMB_HIGHBIT / 2);
- /* Use Newton iterations for approximating 1/sqrt(a) instead of sqrt(a),
- since we can do the former without division. As part of the last
- iteration convert from 1/sqrt(a) to sqrt(a). */
+ /* Get 9 bits of the inverse square root */
+ invroot = ((mp_limb_t)invsqrttab[(a0 >> 23) - 128]) + 256;
- abits = a0 >> (GMP_LIMB_BITS - 1 - 8); /* extract bits for table lookup */
- x0 = 0x100 | invsqrttab[abits - 0x80]; /* initial 1/sqrt(a) */
+ /* Here we have an approximation of 1/sqrt(x), shifted by 24 bits
+ Exactly 9 bits are used */
- /* x0 is now an 8 bits approximation of 1/sqrt(a0) */
+ /* Get the root: it is shifted by 24 - 9 = 15 */
+ root = (vsh >> 9) * invroot;
+ root = root >> 15;
+ /* Here we have the root, not shifted */
-#if GMP_NUMB_BITS > 32
- a1 = a0 >> (GMP_LIMB_BITS - 1 - 32);
- t = (mp_limb_signed_t) (CNST_LIMB(0x2000000000000) - 0x30000 - a1 * x0 * x0) >> 16;
- x0 = (x0 << 16) + ((mp_limb_signed_t) (x0 * t) >> (16+2));
-
- /* x0 is now a 16 bits approximation of 1/sqrt(a0) */
+ /* Perform one Newton iteration for the square root:
+ r = r + (a - r^2) / 2r */
- t2 = x0 * (a0 >> (32-8));
- t = t2 >> 25;
- t = ((mp_limb_signed_t) ((a0 << 14) - t * t - MAGIC) >> (32-8));
- x0 = t2 + ((mp_limb_signed_t) (x0 * t) >> 15);
- x0 >>= 32;
-#else
- t2 = x0 * (a0 >> (16-8));
- t = t2 >> 13;
- t = ((mp_limb_signed_t) ((a0 << 6) - t * t - MAGIC) >> (16-8));
- x0 = t2 + ((mp_limb_signed_t) (x0 * t) >> 7);
- x0 >>= 16;
-#endif
+ accu = a0 - root * root;
+ accu = (accu >> 3) * invroot;
+ /* Here we have (a - r^2) / r, shifted by 24 - 3 = 21
+ Add a bit of rounding to ease correction */
+ root += ((accu + (1 << 21)) >> 21) / 2;
- /* x0 is now a full limb approximation of sqrt(a0) */
+ /* Correction: subtract 1 to the root if needed */
+ accu = a0 - root * root;
+ if (accu - a0 > 0) {
+ accu -= 2*root - 1;
+ root--;
+ }
- x2 = x0 * x0;
- if (x2 + 2*x0 <= a0 - 1)
- {
- x2 += 2*x0 + 1;
- x0++;
- }
-
- *rp = a0 - x2;
- return x0;
+ *rp = a0 - accu;
+ return root;
}
+#else
+/* 64-bit code */
+
+static mp_limb_t
+mpn_sqrtrem1 (mp_ptr rp, mp_limb_t a0)
+{
+ mp_limb_t root, invroot;
+ mp_limb_t accu;
+
+ ASSERT_ALWAYS (GMP_NAIL_BITS == 0);
+ ASSERT_ALWAYS (GMP_LIMB_BITS == 64);
+ ASSERT (a0 >= GMP_NUMB_HIGHBIT / 2);
+
+ // Get 9 bits of the inverse square root
+ invroot = ((mp_limb_t)invsqrttab[(a0 >> 55) - 128]) + 256;
+
+ /* Here we have an approximation of 1/sqrt(x), shifted by 40 bits
+ Exactly 9 bits are used */
+
+ /* Perform one Newton iteration for the inverse square root:
+ y = y * (3 - a * y^2) / 2*/
+
+ /* First, do a * y^2, using as many significant bits as possible */
+ accu = (a0 >> 18) * invroot * invroot;
+ /* Here we have a * y^2 shifted by 2 * 40 - 18 = 62
+ Subtract that to the value 3, shifted accordingly */
+ accu = ((mp_limb_t)3 << 62) - accu;
+ /* Now the multiplication by y / 2: the division by 2 gets part of the shift */
+ invroot = invroot * (accu >> 9);
+
+ /* Here the inverse square root is shifted by 40 + 62 - 9 + 1 = 94 bits
+ 63 or 64 bits are used */
+
+ // Get the root
+ root = (a0 >> 32) * (invroot >> 32);
+ // Here we have the root, shifted by 94 - 64 = 30
+ root = root >> 30;
+ /* Here we have the root, not shifted */
+
+ /* Perform one Newton iteration for the square root:
+ r = r + (a - r^2) / 2r */
+
+ accu = a0 - root * root;
+ accu = (accu >> 20) * (invroot >> 27);
+ // Here we have (a - r^2) / r, shifted by 94 - 47 = 47
+ root += (accu >> 47) / 2;
+
+ /* Correction: add 1 to the root if needed */
+ accu = root * root;
+ if (accu + 2*root <= a0 - 1) {
+ accu += 2*root + 1;
+ root++;
+ }
+
+ *rp = a0 - accu;
+ return root;
+}
+
+#endif
#define Prec (GMP_NUMB_BITS >> 1)
#if ! defined(SQRTREM2_INPLACE)
More information about the gmp-devel
mailing list