From rth at twiddle.net Wed Feb 1 13:25:43 2012 From: rth at twiddle.net (Richard Henderson) Date: Wed, 01 Feb 2012 23:25:43 +1100 Subject: Some arm cortex-a8 improvements Message-ID: <4F292F47.6020306@twiddle.net> Three patches herein. If there's a better way to submit patches, please advise; I've never used hg before. The first patch gives gcc control over ctz/clz. Particularly for armv6t2 and later, which have rbit for use for ctz. The second patch improves multiplication a bit. I'm still playing with addmul_2, but this is a start for addmul_1/mul_1. I couldn't do better than the existing submul_1. Unfortunately the Xscale machines in the gcc build farm are turned off, so I can't test to see if I've regressed on that platform. The third patch tidies up add_n/sub_n, and provides for the carry-in entry points. It's a bit touchy speed testing these. There's no cycle counter available in userspace, and Hz is depressingly low. So I've had to bump the minimum iterations way way up in order to get semi- reliable results. Which causes the speed testing to take quite a long time. Feedback welcome. r~ -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: zz URL: From tg at gmplib.org Fri Feb 10 14:19:05 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Fri, 10 Feb 2012 14:19:05 +0100 Subject: Mainline repo Message-ID: <867gzuhlme.fsf@shell.gmplib.org> I am moving the nightly testing back from the gmp-5.0 repo to the mainline gmp repo. I'll keep the extra barbwire switched on for now; at a later point I will have the nightly builds create two library builds, one with extra checks and without. The latter will be used for tuneup. We made several bug fixes to gmp-5.0 that are not yet in the mainline gmp. Please remember to move your fixes over. (It is probably best to copy the change log entry into the same position, so that discrepances therein better reflect actualy source code differences. That helps me a lot when making dot releases.) -- Torbj?rn From nisse at lysator.liu.se Fri Feb 10 22:19:52 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Fri, 10 Feb 2012 22:19:52 +0100 Subject: Mainline repo In-Reply-To: <867gzuhlme.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Fri, 10 Feb 2012 14:19:05 +0100") References: <867gzuhlme.fsf@shell.gmplib.org> Message-ID: Torbjorn Granlund writes: > We made several bug fixes to gmp-5.0 that are not yet in the mainline > gmp. Please remember to move your fixes over. Moved my changes over now. Summary, by commit id in the gmp-5.0 repo: 13549:5bed10c29692 Assert fix (u0 == u1 case), mpn_gcdext. Fix copied. 13548:77785806d3f1 hgcd_matrix_update_q bug. Fix copied, and code slightly simplified. 13547:ec2c2959dc8c t-gcd and t-hgcd test cases. Improvements copied. 13545:11a901ce5242 gcdext_subdiv_step normalization fix. Current mpn_gcdext_hook seem to be correct. 13544:eab9e2a8bf48 gcdext_subdiv_step carry Uses different code in gcdext_lehmer.c:mpn_gcdext_hook. Related fix to u1n < un case. > (It is probably best to copy the change log entry into the same > position, so that discrepances therein better reflect actualy source > code differences. I'm not sure what order you'd prefer. For three of the above changes, I could copy the ChangeLog entries almost verbatim: I just set the date to today's date, I added a note that they originated in the 5.0 repo, and in one case I had to change the file name since hgcd_matrix_update_q was moved from hgcd.c to hgcd_matrix.c. The other bugfixes had to be handled differently in the main repo. Would you prefer to have the copied entries sorted by the original date? Regards, /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From tg at gmplib.org Fri Feb 10 22:22:15 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Fri, 10 Feb 2012 22:22:15 +0100 Subject: Mainline repo In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Fri\, 10 Feb 2012 22\:19\:52 +0100") References: <867gzuhlme.fsf@shell.gmplib.org> Message-ID: <86ty2yl6yg.fsf@shell.gmplib.org> nisse at lysator.liu.se (Niels M?ller) writes: Would you prefer to have the copied entries sorted by the original date? Yes, please. (We usually also add just one dated entry per hacker and day. That keeps te ChangeLog file size down.) -- Torbj?rn From Paul.Zimmermann at loria.fr Sun Feb 12 18:02:25 2012 From: Paul.Zimmermann at loria.fr (Zimmermann Paul) Date: Sun, 12 Feb 2012 18:02:25 +0100 Subject: fixed-size mpn_mul_n for small n? Message-ID: Hi, GMP currently has variable-size assembly code for mpn_mul_n on some processors. Could it be faster to have fixed-size assembly code for small values of n (say up to n=32)? Then mpn_mul_n() would simply be a wrapper to those fixed-size functions, or to a variable-size code for n>32. Paul Zimmermann From tg at gmplib.org Sun Feb 12 18:10:01 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Sun, 12 Feb 2012 18:10:01 +0100 Subject: fixed-size mpn_mul_n for small n? In-Reply-To: (Zimmermann Paul's message of "Sun\, 12 Feb 2012 18\:02\:25 +0100") References: Message-ID: <861uq0c712.fsf@shell.gmplib.org> Zimmermann Paul writes: GMP currently has variable-size assembly code for mpn_mul_n on some processors. Could it be faster to have fixed-size assembly code for small values of n (say up to n=32)? Then mpn_mul_n() would simply be a wrapper to those fixed-size functions, or to a variable-size code for n>32. There are assembly mpn_mul_basecase for a lot of machines. Some of these offer special code for certain small un,vn. This gives two benefits: (1) less overhead for small sizes amd (2) simpler general code. Providing special code for many un,vn combinations (as separate functions are as part of mpn_mul_basecase) quickly become unmanageable. If we want to handle all sizes <= 16 (say) we'll need 136 variants. I don't think it makes much sense providing code for just un=vn (except that it becomes more manageable...). -- Torbj?rn From Paul.Zimmermann at loria.fr Sun Feb 12 18:44:25 2012 From: Paul.Zimmermann at loria.fr (Zimmermann Paul) Date: Sun, 12 Feb 2012 18:44:25 +0100 Subject: fixed-size mpn_mul_n for small n? In-Reply-To: <861uq0c712.fsf@shell.gmplib.org> (message from Torbjorn Granlund on Sun, 12 Feb 2012 18:10:01 +0100) References: <861uq0c712.fsf@shell.gmplib.org> Message-ID: Torbj?rn, > Providing special code for many un,vn combinations (as separate > functions are as part of mpn_mul_basecase) quickly become unmanageable. > If we want to handle all sizes <= 16 (say) we'll need 136 variants. > > I don't think it makes much sense providing code for just un=vn (except > that it becomes more manageable...). I was thinking of applications doing heavy computations in modular arithmetic, like GMP-ECM, where we only need the case un=vn. I am trying to optimize the modular multiplications and squarings in GMP-ECM (where we use Montgomery's reduction). Below are some timings of low-level functions for different limb sizes up to n=20 on a AMD Phenom(tm) II X2 B55 running at 3Ghz. Paul ****************** Time in microseconds per call, size=1 mpn_mul_n = 0.009601 mpn_sqr = 0.007200 mpn_redc_1 = 0.010401 mpn_redc_2 = 0.013201 mulredc = 0.004800 mul+redc_1 = 0.019601 mul+redc_2 = 0.022801 mul+redc3 = 0.017201 sqr+redc_1 = 0.018401 sqr+redc_2 = 0.020401 sqr+redc3 = 0.014801 mulredc1 = 0.004800 ****************** Time in microseconds per call, size=2 mpn_mul_n = 0.013601 mpn_sqr = 0.009601 mpn_redc_1 = 0.017601 mpn_redc_2 = 0.020001 mulredc = 0.018401 mul+redc_1 = 0.031202 mul+redc_2 = 0.032802 mul+redc3 = 0.032002 sqr+redc_1 = 0.031202 sqr+redc_2 = 0.032802 sqr+redc3 = 0.028002 mulredc1 = 0.007600 ****************** Time in microseconds per call, size=3 mpn_mul_n = 0.018001 mpn_sqr = 0.013201 mpn_redc_1 = 0.026402 mpn_redc_2 = 0.031202 mulredc = 0.028802 mul+redc_1 = 0.046803 mul+redc_2 = 0.050403 mul+redc3 = 0.048003 sqr+redc_1 = 0.039602 sqr+redc_2 = 0.045603 sqr+redc3 = 0.040802 mulredc1 = 0.010001 ****************** Time in microseconds per call, size=4 mpn_mul_n = 0.024002 mpn_sqr = 0.017601 mpn_redc_1 = 0.038402 mpn_redc_2 = 0.036802 mulredc = 0.046403 mul+redc_1 = 0.062404 mul+redc_2 = 0.062404 mul+redc3 = 0.064004 sqr+redc_1 = 0.056004 sqr+redc_2 = 0.056004 sqr+redc3 = 0.056004 mulredc1 = 0.011601 ****************** Time in microseconds per call, size=5 mpn_mul_n = 0.034002 mpn_sqr = 0.028002 mpn_redc_1 = 0.050003 mpn_redc_2 = 0.056003 mulredc = 0.066004 mul+redc_1 = 0.084005 mul+redc_2 = 0.088005 mul+redc3 = 0.092005 sqr+redc_1 = 0.078005 sqr+redc_2 = 0.082005 sqr+redc3 = 0.084005 mulredc1 = 0.013601 ****************** Time in microseconds per call, size=6 mpn_mul_n = 0.040802 mpn_sqr = 0.036002 mpn_redc_1 = 0.064804 mpn_redc_2 = 0.064804 mulredc = 0.091205 mul+redc_1 = 0.108007 mul+redc_2 = 0.108007 mul+redc3 = 0.117607 sqr+redc_1 = 0.100806 sqr+redc_2 = 0.098407 sqr+redc3 = 0.110407 mulredc1 = 0.014801 ****************** Time in microseconds per call, size=7 mpn_mul_n = 0.056004 mpn_sqr = 0.042003 mpn_redc_1 = 0.089606 mpn_redc_2 = 0.086806 mulredc = 0.120408 mul+redc_1 = 0.145609 mul+redc_2 = 0.142808 mul+redc3 = 0.148410 sqr+redc_1 = 0.128808 sqr+redc_2 = 0.128808 sqr+redc3 = 0.131608 mulredc1 = 0.016801 ****************** Time in microseconds per call, size=8 mpn_mul_n = 0.067204 mpn_sqr = 0.051203 mpn_redc_1 = 0.102406 mpn_redc_2 = 0.102406 mulredc = 0.153610 mul+redc_1 = 0.169610 mul+redc_2 = 0.163210 mul+redc3 = 0.179211 sqr+redc_1 = 0.153610 sqr+redc_2 = 0.147210 sqr+redc3 = 0.163210 mulredc1 = 0.018401 ****************** Time in microseconds per call, size=9 mpn_mul_n = 0.090005 mpn_sqr = 0.061204 mpn_redc_1 = 0.126008 mpn_redc_2 = 0.122408 mulredc = 0.190812 mul+redc_1 = 0.212414 mul+redc_2 = 0.219614 mul+redc3 = 0.219614 sqr+redc_1 = 0.183612 sqr+redc_2 = 0.187212 sqr+redc3 = 0.194412 mulredc1 = 0.020001 ****************** Time in microseconds per call, size=10 mpn_mul_n = 0.100006 mpn_sqr = 0.072005 mpn_redc_1 = 0.140008 mpn_redc_2 = 0.144009 mulredc = 0.232015 mul+redc_1 = 0.240015 mul+redc_2 = 0.248015 mul+redc3 = 0.264017 sqr+redc_1 = 0.216013 sqr+redc_2 = 0.216014 sqr+redc3 = 0.232014 mulredc1 = 0.022001 ****************** Time in microseconds per call, size=11 mpn_mul_n = 0.123208 mpn_sqr = 0.079206 mpn_redc_1 = 0.162810 mpn_redc_2 = 0.167210 mulredc = 0.277218 mul+redc_1 = 0.281618 mul+redc_2 = 0.303619 mul+redc3 = 0.303620 sqr+redc_1 = 0.246416 sqr+redc_2 = 0.246416 sqr+redc3 = 0.264017 mulredc1 = 0.023601 ****************** Time in microseconds per call, size=12 mpn_mul_n = 0.139210 mpn_sqr = 0.096006 mpn_redc_1 = 0.192012 mpn_redc_2 = 0.182411 mulredc = 0.326421 mul+redc_1 = 0.331221 mul+redc_2 = 0.321621 mul+redc3 = 0.355223 sqr+redc_1 = 0.283217 sqr+redc_2 = 0.283218 sqr+redc3 = 0.316821 mulredc1 = 0.024802 ****************** Time in microseconds per call, size=13 mpn_mul_n = 0.171611 mpn_sqr = 0.109208 mpn_redc_1 = 0.213213 mpn_redc_2 = 0.218413 mulredc = 0.379625 mul+redc_1 = 0.384824 mul+redc_2 = 0.390025 mul+redc3 = 0.421226 sqr+redc_1 = 0.327621 sqr+redc_2 = 0.327621 sqr+redc3 = 0.353622 mulredc1 = 0.026802 ****************** Time in microseconds per call, size=14 mpn_mul_n = 0.184813 mpn_sqr = 0.123207 mpn_redc_1 = 0.229614 mpn_redc_2 = 0.229616 mulredc = 0.431227 mul+redc_1 = 0.414426 mul+redc_2 = 0.420027 mul+redc3 = 0.464830 sqr+redc_1 = 0.352823 sqr+redc_2 = 0.352821 sqr+redc3 = 0.392026 mulredc1 = 0.029202 ****************** Time in microseconds per call, size=15 mpn_mul_n = 0.204014 mpn_sqr = 0.138008 mpn_redc_1 = 0.270018 mpn_redc_2 = 0.264017 mulredc = 0.498030 mul+redc_1 = 0.474030 mul+redc_2 = 0.492032 mul+redc3 = 0.522032 sqr+redc_1 = 0.402026 sqr+redc_2 = 0.408026 sqr+redc3 = 0.456029 mulredc1 = 0.030002 ****************** Time in microseconds per call, size=16 mpn_mul_n = 0.236814 mpn_sqr = 0.153610 mpn_redc_1 = 0.307219 mpn_redc_2 = 0.294419 mulredc = 0.556834 mul+redc_1 = 0.537634 mul+redc_2 = 0.524834 mul+redc3 = 0.582437 sqr+redc_1 = 0.454427 sqr+redc_2 = 0.448029 sqr+redc3 = 0.505632 mulredc1 = 0.031602 ****************** Time in microseconds per call, size=17 mpn_mul_n = 0.278819 mpn_sqr = 0.163210 mpn_redc_1 = 0.333221 mpn_redc_2 = 0.326421 mulredc = 0.632439 mul+redc_1 = 0.605238 mul+redc_2 = 0.612039 mul+redc3 = 0.659641 sqr+redc_1 = 0.503233 sqr+redc_2 = 0.489631 sqr+redc3 = 0.564434 mulredc1 = 0.034002 ****************** Time in microseconds per call, size=18 mpn_mul_n = 0.295218 mpn_sqr = 0.187211 mpn_redc_1 = 0.352824 mpn_redc_2 = 0.352822 mulredc = 0.705644 mul+redc_1 = 0.648042 mul+redc_2 = 0.640840 mul+redc3 = 0.734448 sqr+redc_1 = 0.540033 sqr+redc_2 = 0.540035 sqr+redc3 = 0.626440 mulredc1 = 0.035602 ****************** Time in microseconds per call, size=19 mpn_mul_n = 0.319221 mpn_sqr = 0.197612 mpn_redc_1 = 0.395225 mpn_redc_2 = 0.418027 mulredc = 0.782851 mul+redc_1 = 0.714445 mul+redc_2 = 0.729647 mul+redc3 = 0.805653 sqr+redc_1 = 0.608039 sqr+redc_2 = 0.623239 sqr+redc3 = 0.691645 mulredc1 = 0.036802 ****************** Time in microseconds per call, size=20 mpn_mul_n = 0.360022 mpn_sqr = 0.224014 mpn_redc_1 = 0.440028 mpn_redc_2 = 0.440028 mulredc = 0.864054 mul+redc_1 = 0.792048 mul+redc_2 = 0.792050 mul+redc3 = 0.880056 sqr+redc_1 = 0.664040 sqr+redc_2 = 0.656042 sqr+redc3 = 0.760048 mulredc1 = 0.038402 From nisse at lysator.liu.se Sun Feb 12 20:06:20 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Sun, 12 Feb 2012 20:06:20 +0100 Subject: fixed-size mpn_mul_n for small n? In-Reply-To: (Zimmermann Paul's message of "Sun, 12 Feb 2012 18:44:25 +0100") References: <861uq0c712.fsf@shell.gmplib.org> Message-ID: Zimmermann Paul writes: > I am trying to optimize the modular multiplications and squarings in GMP-ECM > (where we use Montgomery's reduction). For these moderate sizes, does it pay off to precompute a full inverse for the montgomery reduction, rather than using redc_1 or redc_2? (I don't quite understand which lines in you benchmark data I should look at). Have you tried the "bidirectional" trick we discussed a while ago (IIRC it was your idea): For size n, instead of the standard montgomery representation x' = B^n x (mod m), use the representation x' = B^{n/2} x (mod m), and each time a size 2n product is to be reduced, cancel n/2 limbs from the right, and n/2 from the left? For the _1 version, it might even make sense to make a single loop working from both ends. And back to the original question: I guess you could try to completely unroll the multiplication for some size of interest, and compare to the general mpn_mul_basecase. My understanding is that current branch predictors are fairly good at the case of a loop which always runs for the same number of iterations, so the loop overhead (even without unrolling) shouldn't be severe. Totally unrolled code might have a greater potential for speedups for squaring, mullo and mulhi, than for mul_basecase, since the former probably are less friendly to the branch prediction. Regards, /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From Paul.Zimmermann at loria.fr Sun Feb 12 20:29:51 2012 From: Paul.Zimmermann at loria.fr (Zimmermann Paul) Date: Sun, 12 Feb 2012 20:29:51 +0100 Subject: fixed-size mpn_mul_n for small n? In-Reply-To: (nisse@lysator.liu.se) References: <861uq0c712.fsf@shell.gmplib.org> Message-ID: Niels, > For these moderate sizes, does it pay off to precompute a full inverse > for the montgomery reduction, rather than using redc_1 or redc_2? (I > don't quite understand which lines in you benchmark data I should look > at). I believe those sizes are too small. With a full inverse, we need that two short products (one mullo and one mulhi) are faster than redc_1 or redc_2. This does not seem to be the case for 14 limbs, where mpn_redc_n takes 0.45us, compared to 0.23us for mpn_redc_1: Time in microseconds per call, size=14 mpn_mul_n = 0.179211 mpn_sqr = 0.123209 mpn_redc_1 = 0.229614 mpn_redc_2 = 0.235214 ecm_redc3 = 0.274418 # this is GMP-ECM variable-size assembly redc mpn_redc_n = 0.448028 mulredc = 0.431227 # this is GMP-ECM assembly combined mul+redc mul+redc_1 = 0.420027 mul+redc_2 = 0.414426 mul+redc3 = 0.464830 mul+redc_n = 0.627240 sqr+redc_1 = 0.358423 sqr+redc_2 = 0.352823 sqr+redc3 = 0.403226 sqr+redc_n = 0.554434 Legend: "mul" means mpn_mul_n, "sqr" means mpn_sqr > Have you tried the "bidirectional" trick we discussed a while ago (IIRC > it was your idea): For size n, instead of the standard montgomery > representation x' = B^n x (mod m), use the representation x' = B^{n/2} x (mod > m), and each time a size 2n product is to be reduced, cancel n/2 > limbs from the right, and n/2 from the left? For the _1 version, it > might even make sense to make a single loop working from both ends. no I didn't. But it doesn't save any computation, since to cancel n/2 limbs, you need n/2 addmul_1 calls of size n, thus in total n addmul_1 calls of size n. The only benefit is when using 2 threads reducing from the left and right. > And back to the original question: I guess you could try to completely > unroll the multiplication for some size of interest, and compare to the > general mpn_mul_basecase. I know nothing about assembly, this is why I asked on this list :-) > My understanding is that current branch predictors are fairly good at > the case of a loop which always runs for the same number of iterations, > so the loop overhead (even without unrolling) shouldn't be severe. > Totally unrolled code might have a greater potential for speedups for > squaring, mullo and mulhi, than for mul_basecase, since the former > probably are less friendly to the branch prediction. ok. Any speedup in squaring (mpn_sqr) is also most welcome, since our best ECM Stage 1 code performs 4 modular multiplications and 4 modular squaring per iteration. Paul From nisse at lysator.liu.se Mon Feb 13 13:12:03 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Mon, 13 Feb 2012 13:12:03 +0100 Subject: fixed-size mpn_mul_n for small n? In-Reply-To: (Zimmermann Paul's message of "Sun, 12 Feb 2012 20:29:51 +0100") References: <861uq0c712.fsf@shell.gmplib.org> Message-ID: Zimmermann Paul writes: > I believe those sizes are too small. With a full inverse, we need that two > short products (one mullo and one mulhi) are faster than redc_1 or > redc_2. This does not seem to be the case for 14 limbs, where mpn_redc_n takes > 0.45us, compared to 0.23us for mpn_redc_1: I see. >> And back to the original question: I guess you could try to completely >> unroll the multiplication for some size of interest, and compare to the >> general mpn_mul_basecase. I think there's some potential for speed up of the linear term, which is mostly relevant for small sizes. The addmul_1 calls can run at 3 cycles per limb or so. But then the computing the quotient involves dependent multiplications with longer latency, so one may be able to compute the independent left and right quotient in less time than computing two quotients at the same end. Unclear to me if that's going to make any difference in real code, in particular since then left-to-right quotient will require some kind of adjustment step. > ok. Any speedup in squaring (mpn_sqr) is also most welcome, since our best > ECM Stage 1 code performs 4 modular multiplications and 4 modular squaring > per iteration. What sizes are important? I have started to look a little into elliptic curve cryptography, and there the sizes are pretty small. E.g., Using the standard curve over a 256-bit prime means that numbers are just four limbs on a 64-bit machine. So in this case, I'd expect a specialized a completely unrolled squaring function for this size could make a real difference. Regards, /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From nisse at lysator.liu.se Mon Feb 13 13:20:35 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Mon, 13 Feb 2012 13:20:35 +0100 Subject: toom54 Message-ID: I noticed that toom54 is missing. And it's easy, since all the building blocks already are in place. Patch below. Comments? I'd expect it to have a place with toom43 and toom44 below, and toom6h above, but I have no good guess on how large that place might be. Also toom72 is missing, which would use the same interpolation function as toom54 and toom63. Not sure how useful that might be. Regards, /Niels diff -r b856752462ac configure.in --- a/configure.in Mon Feb 13 12:38:05 2012 +0100 +++ b/configure.in Mon Feb 13 12:40:42 2012 +0100 @@ -2634,7 +2634,7 @@ hgcd2_jacobi hgcd_jacobi \ mullo_n mullo_basecase \ toom22_mul toom32_mul toom42_mul toom52_mul toom62_mul \ - toom33_mul toom43_mul toom53_mul toom63_mul \ + toom33_mul toom43_mul toom53_mul toom54_mul toom63_mul \ toom44_mul \ toom6h_mul toom6_sqr toom8h_mul toom8_sqr \ toom_couple_handling \ diff -r b856752462ac gmp-impl.h --- a/gmp-impl.h Mon Feb 13 12:38:05 2012 +0100 +++ b/gmp-impl.h Mon Feb 13 12:40:42 2012 +0100 @@ -4983,6 +4983,13 @@ return 9 * n + 3; } +static inline mp_size_t +mpn_toom54_mul_itch (mp_size_t an, mp_size_t bn) +{ + mp_size_t n = 1 + (4 * an >= 5 * bn ? (an - 1) / (size_t) 5 : (bn - 1) / (size_t) 4); + return 9 * n + 3; +} + /* let S(n) = space required for input size n, then S(n) = 3 floor(n/2) + 1 + S(floor(n/2)). */ #define mpn_toom42_mulmid_itch(n) \ diff -r b856752462ac mpn/generic/toom54_mul.c --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/mpn/generic/toom54_mul.c Mon Feb 13 12:40:42 2012 +0100 @@ -0,0 +1,135 @@ +/* Implementation of the toom54 (same points as toom63) + + Contributed to the GNU project by Niels M?ller. + + THE FUNCTION IN THIS FILE IS INTERNAL WITH A MUTABLE INTERFACE. IT IS ONLY + SAFE TO REACH IT THROUGH DOCUMENTED INTERFACES. IN FACT, IT IS ALMOST + GUARANTEED THAT IT WILL CHANGE OR DISAPPEAR IN A FUTURE GNU MP RELEASE. + +Copyright 2009, 2012 Free Software Foundation, Inc. + +This file is part of the GNU MP Library. + +The GNU MP Library is free software; you can redistribute it and/or modify +it under the terms of the GNU Lesser General Public License as published by +the Free Software Foundation; either version 3 of the License, or (at your +option) any later version. + +The GNU MP Library is distributed in the hope that it will be useful, but +WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY +or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public +License for more details. + +You should have received a copy of the GNU Lesser General Public License +along with the GNU MP Library. If not, see http://www.gnu.org/licenses/. */ + + +#include "gmp.h" +#include "gmp-impl.h" + + +/* Toom-4.5, the splitting 5x4 unbalanced version. + Evaluate in: infinity, +4, -4, +2, -2, +1, -1, 0. + + <--s-><--n--><--n--><--n--><--n--> + ____ ______ ______ ______ ______ + |_a4_|__a3__|__a2__|__a1__|__a0__| + |b3_|__b2__|__b1__|__b0__| + <-t-><--n--><--n--><--n--> + +*/ +#define TOOM_54_MUL_N_REC(p, a, b, n, ws) \ + do { mpn_mul_n (p, a, b, n); \ + } while (0) + +#define TOOM_54_MUL_REC(p, a, na, b, nb, ws) \ + do { mpn_mul (p, a, na, b, nb); \ + } while (0) + +void +mpn_toom54_mul (mp_ptr pp, + mp_srcptr ap, mp_size_t an, + mp_srcptr bp, mp_size_t bn, mp_ptr scratch) +{ + mp_size_t n, s, t; + mp_limb_t cy; + int sign; + + /***************************** decomposition *******************************/ +#define a4 (ap + 4 * n) +#define b3 (bp + 3 * n) + + ASSERT (an >= bn); + n = 1 + (4 * an >= 5 * bn ? (an - 1) / (size_t) 5 : (bn - 1) / (size_t) 4); + + s = an - 4 * n; + t = bn - 3 * n; + + ASSERT (0 < s && s <= n); + ASSERT (0 < t && t <= n); + /* WARNING! it assumes s+t>=n */ + ASSERT ( s + t >= n ); + ASSERT ( s + t > 4); + ASSERT ( n > 2); + +#define r8 pp /* 2n */ +#define r7 scratch /* 3n+1 */ +#define r5 (pp + 3*n) /* 3n+1 */ +#define v0 (pp + 3*n) /* n+1 */ +#define v1 (pp + 4*n+1) /* n+1 */ +#define v2 (pp + 5*n+2) /* n+1 */ +#define v3 (pp + 6*n+3) /* n+1 */ +#define r3 (scratch + 3 * n + 1) /* 3n+1 */ +#define r1 (pp + 7*n) /* s+t <= 2*n */ +#define ws (scratch + 6 * n + 2) /* ??? */ + + /* Alloc also 3n+1 limbs for ws... mpn_toom_interpolate_8pts may + need all of them, when DO_mpn_sublsh_n usea a scratch */ + /********************** evaluation and recursive calls *********************/ + /* $\pm4$ */ + sign = mpn_toom_eval_pm2exp (v2, v0, 4, ap, n, s, 2, pp); + sign ^= mpn_toom_eval_pm2exp (v3, v1, 3, bp, n, t, 2, pp); + TOOM_54_MUL_N_REC(pp, v0, v1, n + 1, ws); /* A(-4)*B(-4) */ + TOOM_54_MUL_N_REC(r3, v2, v3, n + 1, ws); /* A(+4)*B(+4) */ + mpn_toom_couple_handling (r3, 2*n+1, pp, sign, n, 2, 4); /* FIXME: ...,2,4 ?*/ + + /* $\pm1$ */ + sign = mpn_toom_eval_pm1 (v2, v0, 4, ap, n, s, pp); + sign ^= mpn_toom_eval_dgr3_pm1 (v3, v1, bp, n, t, pp); + TOOM_54_MUL_N_REC(pp, v0, v1, n + 1, ws); /* A(-1)*B(-1) */ + TOOM_54_MUL_N_REC(r7, v2, v3, n + 1, ws); /* A(1)*B(1) */ + mpn_toom_couple_handling (r7, 2*n+1, pp, sign, n, 0, 0); + + /* $\pm2$ */ + sign = mpn_toom_eval_pm2 (v2, v0, 4, ap, n, s, pp); + sign ^= mpn_toom_eval_dgr3_pm2 (v3, v1, bp, n, t, pp); + TOOM_54_MUL_N_REC(pp, v0, v1, n + 1, ws); /* A(-2)*B(-2) */ + TOOM_54_MUL_N_REC(r5, v2, v3, n + 1, ws); /* A(+2)*B(+2) */ + mpn_toom_couple_handling (r5, 2*n+1, pp, sign, n, 1, 2); /* FIXME: ...,1,2)? */ + + /* A(0)*B(0) */ + TOOM_54_MUL_N_REC(pp, ap, bp, n, ws); + + /* Infinity */ + if (s > t) { + TOOM_54_MUL_REC(r1, a4, s, b3, t, ws); + } else { + TOOM_54_MUL_REC(r1, b3, t, a4, s, ws); + }; + + mpn_toom_interpolate_8pts (pp, n, r3, r7, s + t, ws); + +#undef a4 +#undef b3 +#undef r1 +#undef r3 +#undef r5 +#undef v0 +#undef v1 +#undef v2 +#undef v3 +#undef r7 +#undef r8 +#undef ws +} + diff -r b856752462ac tests/mpn/Makefile.am --- a/tests/mpn/Makefile.am Mon Feb 13 12:38:05 2012 +0100 +++ b/tests/mpn/Makefile.am Mon Feb 13 12:40:42 2012 +0100 @@ -25,7 +25,7 @@ check_PROGRAMS = t-asmtype t-aors_1 t-divrem_1 t-mod_1 t-fat t-get_d \ t-instrument t-iord_u t-mp_bases t-perfsqr t-scan \ t-toom22 t-toom32 t-toom33 t-toom42 t-toom43 t-toom44 \ - t-toom52 t-toom53 t-toom62 t-toom63 t-toom6h t-toom8h \ + t-toom52 t-toom53 t-toom54 t-toom62 t-toom63 t-toom6h t-toom8h \ t-mul t-mullo t-mulmod_bnm1 t-sqrmod_bnm1 t-mulmid \ t-hgcd t-hgcd_appr t-matrix22 t-invert t-div t-bdiv diff -r b856752462ac tests/mpn/t-toom54.c --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/mpn/t-toom54.c Mon Feb 13 12:40:42 2012 +0100 @@ -0,0 +1,8 @@ +#define mpn_toomMN_mul mpn_toom54_mul +#define mpn_toomMN_mul_itch mpn_toom54_mul_itch + +#define MIN_AN 31 +#define MIN_BN(an) ((3*(an) + 32) / (size_t) 5) /* 3/5 */ +#define MAX_BN(an) ((an) - 6) /* 1/1 */ + +#include "toom-shared.h" -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From tg at gmplib.org Mon Feb 13 13:25:41 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Mon, 13 Feb 2012 13:25:41 +0100 Subject: fixed-size mpn_mul_n for small n? In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Mon\, 13 Feb 2012 13\:12\:03 +0100") References: <861uq0c712.fsf@shell.gmplib.org> Message-ID: <86pqdizzqy.fsf@shell.gmplib.org> nisse at lysator.liu.se (Niels M?ller) writes: I have started to look a little into elliptic curve cryptography, and there the sizes are pretty small. E.g., Using the standard curve over a 256-bit prime means that numbers are just four limbs on a 64-bit machine. So in this case, I'd expect a specialized a completely unrolled squaring function for this size could make a real difference. The x86_64 sqr_basecase has special code for n <= 4. -- Torbj?rn From tg at gmplib.org Mon Feb 13 13:39:23 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Mon, 13 Feb 2012 13:39:23 +0100 Subject: toom54 In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Mon\, 13 Feb 2012 13\:20\:35 +0100") References: Message-ID: <86lio6zz44.fsf@shell.gmplib.org> nisse at lysator.liu.se (Niels M?ller) writes: I noticed that toom54 is missing. And it's easy, since all the building blocks already are in place. Patch below. Comments? I'd expect it to have a place with toom43 and toom44 below, and toom6h above, but I have no good guess on how large that place might be. Also toom72 is missing, which would use the same interpolation function as toom54 and toom63. Not sure how useful that might be. I am afraid Marco posted both a long time ago (2009?). They live in shell:~tege/gmp/mpn/generic/toom{54,72}_mul.c. You might want to merge your versions. The diagrams at https://gmplib.org/devel/ include timing for Marco's functions. It seems toom54 is quite useful, toom72 less so. (These diagrams are from 2009, things will have changed.) The tricky part might be making good use of them in mul.c. (I suppose we never checked them in because we never fixed mul.c.) -- Torbj?rn From Paul.Zimmermann at loria.fr Mon Feb 13 13:47:31 2012 From: Paul.Zimmermann at loria.fr (Zimmermann Paul) Date: Mon, 13 Feb 2012 13:47:31 +0100 Subject: fixed-size mpn_mul_n for small n? In-Reply-To: (nisse@lysator.liu.se) References: <861uq0c712.fsf@shell.gmplib.org> Message-ID: Niels, > I think there's some potential for speed up of the linear term, which is > mostly relevant for small sizes. The addmul_1 calls can run at 3 cycles > per limb or so. But then the computing the quotient involves dependent > multiplications with longer latency, so one may be able to compute the > independent left and right quotient in less time than computing two > quotients at the same end. Unclear to me if that's going to make any > difference in real code, in particular since then left-to-right quotient > will require some kind of adjustment step. agreed. But to avoid this dependent multiplication, I believe Montgomery-Svoboda has more potential. Instead of performing n addmul_1 calls with N, you perform n-1 call with k*N such that the low limb of k*N is -1 (thus the quotient selection is trivial) and one call with N. Here is a reference C implementation (not tested), where {sp, nn} = (k*N+1)/B: static void ecm_redc_1_svoboda (mp_ptr rp, mp_ptr tmp, mp_srcptr np, mp_size_t nn, mp_limb_t invm, mp_srcptr sp) { mp_size_t j; mp_limb_t t0, cy; /* instead of adding {np, nn} * (invm * tmp[0] mod B), we add {sp, nn} * tmp[0], where {np, nn} * invm = B * {sp, nn} - 1 */ for (j = 0; j < nn - 1; j++, tmp++) rp[j + 1] = mpn_addmul_1 (tmp + 1, sp, nn, tmp[0]); /* for the last step, we reduce with {np, nn} */ t0 = mpn_addmul_1 (tmp, np, nn, tmp[0] * invm); tmp ++; rp[0] = tmp[0]; cy = mpn_add_n (rp + 1, rp + 1, tmp + 1, nn - 1); rp[nn-1] += t0; cy += rp[nn-1] < t0; if (cy != 0) mpn_sub_n (rp, rp, np, nn); /* a borrow should always occur here */ } Of course the same idea could be applied to redc_2. > What sizes are important? > > I have started to look a little into elliptic curve cryptography, and > there the sizes are pretty small. E.g., Using the standard curve over a > 256-bit prime means that numbers are just four limbs on a 64-bit > machine. So in this case, I'd expect a specialized a completely unrolled > squaring function for this size could make a real difference. for GMP-ECM, the most important range is from 10 to 20 limbs, i.e., about 200 to 400 decimal digits. Paul From nisse at lysator.liu.se Mon Feb 13 14:08:47 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Mon, 13 Feb 2012 14:08:47 +0100 Subject: toom54 In-Reply-To: <86lio6zz44.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Mon, 13 Feb 2012 13:39:23 +0100") References: <86lio6zz44.fsf@shell.gmplib.org> Message-ID: Torbjorn Granlund writes: > I am afraid Marco posted both a long time ago (2009?). They live in > shell:~tege/gmp/mpn/generic/toom{54,72}_mul.c. Ah. That version is virtually identical (not surprising, given that both versions are intimately related to the same toom63_mul.c). Just some different names for the helper functions, and ASSERT (s + t > n) vs ASSERT (s + t >= n), related to your recent change of mpn_toom_interpolate_8pts. > The diagrams at https://gmplib.org/devel/ include timing for Marco's > functions. It seems toom54 is quite useful, toom72 less so. (These > diagrams are from 2009, things will have changed.) Should I push in toom54 then? (Naturally, Marco should have the credit). > The tricky part might be making good use of them in mul.c. (I suppose > we never checked them in because we never fixed mul.c.) toom52 and toom62 are also unused. Which reminds me that I should correct the toom63 row in the diagram in mul.c. /nisse -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From nisse at lysator.liu.se Mon Feb 13 15:07:45 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Mon, 13 Feb 2012 15:07:45 +0100 Subject: fixed-size mpn_mul_n for small n? In-Reply-To: (Zimmermann Paul's message of "Mon, 13 Feb 2012 13:47:31 +0100") References: <861uq0c712.fsf@shell.gmplib.org> Message-ID: Zimmermann Paul writes: > agreed. But to avoid this dependent multiplication, I believe > Montgomery-Svoboda has more potential. Instead of performing n addmul_1 calls > with N, you perform n-1 call with k*N such that the low limb of k*N is -1 > (thus the quotient selection is trivial) and one call with N. Ah, that's clever. I wasn't aware of that method. > for GMP-ECM, the most important range is from 10 to 20 limbs, i.e., about 200 > to 400 decimal digits. Using completely unrolled code for all these sizes (I'm now primarily thinking of squaring) seems a bit impractical. Maybe one could do something reasonable with specialcase code for collecting the off-diagonal terms (long code with jumpts into), and then a plain loop for the diagonal and the final shift + add. /nisse -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From bodrato at mail.dm.unipi.it Mon Feb 13 17:44:10 2012 From: bodrato at mail.dm.unipi.it (bodrato at mail.dm.unipi.it) Date: Mon, 13 Feb 2012 17:44:10 +0100 (CET) Subject: toom54 In-Reply-To: References: <86lio6zz44.fsf@shell.gmplib.org> Message-ID: <49377.151.32.244.160.1329151450.squirrel@mail.dm.unipi.it> Ciao, Il Lun, 13 Febbraio 2012 2:08 pm, Niels M?ller ha scritto: > Torbjorn Granlund writes: >> shell:~tege/gmp/mpn/generic/toom{54,72}_mul.c. > > Ah. That version is virtually identical (not surprising, given that both > versions are intimately related to the same toom63_mul.c). Just some Yes, Toom-4.5 inversion is structured with the Toom'n'half strategy. It shouldn't be difficult to write a single function working both as 54 and 63 :-) For those people with no access to shell, my (old) code is available on my site: http://bodrato.it/software/toom.html#TC4.5 . >> The diagrams at https://gmplib.org/devel/ include timing for Marco's >> functions. It seems toom54 is quite useful, toom72 less so. (These >> diagrams are from 2009, things will have changed.) Yes, things have changed, the main such change comes from the new toom6h and toom8h functions. It would be nice to regenerate the diagrams with the new algorithms. I guess toom72 will not cover a wide region in a current version. > toom52 and toom62 are also unused. Which reminds me that I should > correct the toom63 row in the diagram in mul.c. And that diagram reminds me that the unbalancement capability of toom6h and toom8h are still unused... Regards, m -- http://bodrato.it/toom-cook/ From Paul.Zimmermann at loria.fr Mon Feb 13 19:00:43 2012 From: Paul.Zimmermann at loria.fr (Zimmermann Paul) Date: Mon, 13 Feb 2012 19:00:43 +0100 Subject: fixed-size mpn_mul_n for small n? In-Reply-To: (nisse@lysator.liu.se) References: <861uq0c712.fsf@shell.gmplib.org> Message-ID: Niels, > > agreed. But to avoid this dependent multiplication, I believe > > Montgomery-Svoboda has more potential. Instead of performing n addmul_1 calls > > with N, you perform n-1 call with k*N such that the low limb of k*N is -1 > > (thus the quotient selection is trivial) and one call with N. > > Ah, that's clever. I wasn't aware of that method. the subquadratic version is described in [1], Algorithm 2.8. However I only recently figured out that with (kN+1)/B precomputed, you only need an addmul_1 of length n (instead of n+1) at each step for the quadratic version. > > for GMP-ECM, the most important range is from 10 to 20 limbs, i.e., about 200 > > to 400 decimal digits. for RSA signature and encryption, 16 and 32 limbs are important targets too. > Using completely unrolled code for all these sizes (I'm now primarily > thinking of squaring) seems a bit impractical. Maybe one could do > something reasonable with specialcase code for collecting the > off-diagonal terms (long code with jumpts into), and then a plain loop > for the diagonal and the final shift + add. I found a variant, but I'm not sure it is better: 1) first put the diagonal terms in place (this will fill exactly the 2n buffer) 2) divide by 2 (if the input is odd, store the carry out) 3) accumulate the off-diagonal terms (could be done in assembly as you suggest) 4) multiply by 2 (and restore the carry out) 5) perform the usual reduction Paul [1] Modern Computer Arithmetic, Richard Brent and Paul Zimmermann, Cambridge University Press, 2010, available online. From nisse at lysator.liu.se Mon Feb 13 21:38:26 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Mon, 13 Feb 2012 21:38:26 +0100 Subject: toom54 In-Reply-To: <49377.151.32.244.160.1329151450.squirrel@mail.dm.unipi.it> (bodrato@mail.dm.unipi.it's message of "Mon, 13 Feb 2012 17:44:10 +0100 (CET)") References: <86lio6zz44.fsf@shell.gmplib.org> <49377.151.32.244.160.1329151450.squirrel@mail.dm.unipi.it> Message-ID: bodrato at mail.dm.unipi.it writes: > Yes, Toom-4.5 inversion is structured with the Toom'n'half strategy. It > shouldn't be difficult to write a single function working both as 54 and > 63 :-) That's a question of taste. I'd prefer separate functions, and then leave choice of stratey to mpn_mul. But which style really makes the most sense depends a bit on where relevant thresholds end up. toom54 is really a small and simply function now, the interesting piece is just 35 lines. The reason toom63 is a bit larger is that evaluation of the degree 2 polynomial (3 coefficients) is inlined. One could save some code size by writing some helper functions for this case as well (shared by toom[3456]3 and possibly also toom32). Or if the function call overhead is too expensive for toom33 and toom32, mpn_toom_eval_dgr2_pm1 could be done as a macro rather than a function, even if that would just reduce source code size, not object code size. > Yes, things have changed, the main such change comes from the new toom6h > and toom8h functions. It would be nice to regenerate the diagrams with the > new algorithms. I guess toom72 will not cover a wide region in a current > version. I think it might provide some insight to do benchmarks along fixed ratios. For each algorithm, there's a supported range. Take both end points and the midpoint. Along each of these three lines, benchmark for a range of sizes with a fixed ration, and see where the algorithm beats other algorithms which support the same ratio. There's some complication from the unbalanced calls (which one would want to use an optimal algorithm choice), but I hope that will have a fairly small influence on the results. It would also be interesting to prepare some graph showing for each function which functions it may use for the recursive calls (with a close-to-optimal algorithm choice). E.g., I suspect toom32 shouldn't call anything but basecase, toom32 and toom22. Regards, /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From tg at gmplib.org Tue Feb 14 20:09:24 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Tue, 14 Feb 2012 20:09:24 +0100 Subject: toom54 In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Mon\, 13 Feb 2012 21\:38\:26 +0100") References: <86lio6zz44.fsf@shell.gmplib.org> <49377.151.32.244.160.1329151450.squirrel@mail.dm.unipi.it> Message-ID: <86vcn9w7tn.fsf@shell.gmplib.org> nisse at lysator.liu.se (Niels M?ller) writes: I think it might provide some insight to do benchmarks along fixed ratios. For each algorithm, there's a supported range. Take both end points and the midpoint. Along each of these three lines, benchmark for a range of sizes with a fixed ration, and see where the algorithm beats other algorithms which support the same ratio. There's some complication from the unbalanced calls (which one would want to use an optimal algorithm choice), but I hope that will have a fairly small influence on the results. It would also be interesting to prepare some graph showing for each function which functions it may use for the recursive calls (with a close-to-optimal algorithm choice). E.g., I suspect toom32 shouldn't call anything but basecase, toom32 and toom22. Not so sure. But we can decide to make it that. It currently is fastest for large operands, in a *narrow* space between 43 and 53, and its subproducts will want 33. To enable 54 in today's framework, it should probably be here: else if (2 * un < 3 * vn) { if (BELOW_THRESHOLD (vn, MUL_TOOM32_TO_TOOM43_THRESHOLD)) mpn_toom32_mul (prodp, up, un, vp, vn, scratch); ! else if (BELOW_THRESHOLD (vn, MUL_TOOM43_TO_TOOM54_THRESHOLD)) mpn_toom43_mul (prodp, up, un, vp, vn, scratch); ! else ! mpn_toom54_mul (prodp, up, un, vp, vn, scratch); } But first MUL_TOOM43_TO_TOOM54_THRESHOLD needs to be measured and set to some default. -- Torbj?rn From nisse at lysator.liu.se Wed Feb 15 10:13:06 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Wed, 15 Feb 2012 10:13:06 +0100 Subject: toom54 In-Reply-To: <86vcn9w7tn.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Tue, 14 Feb 2012 20:09:24 +0100") References: <86lio6zz44.fsf@shell.gmplib.org> <49377.151.32.244.160.1329151450.squirrel@mail.dm.unipi.it> <86vcn9w7tn.fsf@shell.gmplib.org> Message-ID: Torbjorn Granlund writes: > Not so sure. But we can decide to make it that. It currently is > fastest for large operands, in a *narrow* space between 43 and 53, and > its subproducts will want 33. If so, we have an apparent cycle in the graph, toom33 ^ \ / \ V V toom32 <---> toom22 For the range where toom32 is (or should be) used, its calls to toom33 shouldn't generate any calls to higher tooms(?). And when toom22 calls toom32, it also shouldn't call toom33. Hence, tha above subgraph is for calls to toom32, while for a call to toom22, we only have toom32 <---> toom22 Any moving up, in the general case, I imagine toom33 might at least call toom43? It's going to be a fairly complicated graph. I'm also thinking about how to do the itch functions for these functions, taking all recursive calls into account. I still think we want closed formulas for the lowest tooms, while that's most likely *not* practical for the higher ones. > But first MUL_TOOM43_TO_TOOM54_THRESHOLD needs to be measured and set > to some default. I've started to look at tuning. I don't yet understand how the TOOMX_TO_TOOMY tuning works. I started by adding MPN_TOOM54_MUL_MINSIZE, and then I noticed that the corresponding values for toom43 and toom53 are commented with "???" in gmp-impl.h. Is it correct to use the same value as MIN_AN in the corresponding testfiles? If so, I think this is the right patch: diff -r 605ce4a6238b gmp-impl.h --- a/gmp-impl.h Tue Feb 14 21:41:21 2012 +0100 +++ b/gmp-impl.h Wed Feb 15 10:04:32 2012 +0100 @@ -1247,8 +1247,9 @@ typedef struct { #define MPN_TOOM32_MUL_MINSIZE 10 #define MPN_TOOM42_MUL_MINSIZE 10 -#define MPN_TOOM43_MUL_MINSIZE 49 /* ??? */ -#define MPN_TOOM53_MUL_MINSIZE 49 /* ??? */ +#define MPN_TOOM43_MUL_MINSIZE 25 +#define MPN_TOOM53_MUL_MINSIZE 17 +#define MPN_TOOM54_MUL_MINSIZE 31 #define MPN_TOOM63_MUL_MINSIZE 49 #define MPN_TOOM42_MULMID_MINSIZE 4 -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From nisse at lysator.liu.se Wed Feb 15 15:08:54 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Wed, 15 Feb 2012 15:08:54 +0100 Subject: toom54 In-Reply-To: <86vcn9w7tn.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Tue, 14 Feb 2012 20:09:24 +0100") References: <86lio6zz44.fsf@shell.gmplib.org> <49377.151.32.244.160.1329151450.squirrel@mail.dm.unipi.it> <86vcn9w7tn.fsf@shell.gmplib.org> Message-ID: Torbjorn Granlund writes: > But first MUL_TOOM43_TO_TOOM54_THRESHOLD needs to be measured and set to > some default. I have pushed in an attempt at tuning code. Please have a look to see if makes sense. I measure a line with an operand size ratio 5/6. On this core2 machine I get #define MUL_TOOM43_TO_TOOM54_THRESHOLD 100 I set the default to 150, since other tuned thresholds seem to be a bit smaller than the defaults. /nisse -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From tg at gmplib.org Thu Feb 16 07:54:02 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Thu, 16 Feb 2012 07:54:02 +0100 Subject: toom54 In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Wed\, 15 Feb 2012 15\:08\:54 +0100") References: <86lio6zz44.fsf@shell.gmplib.org> <49377.151.32.244.160.1329151450.squirrel@mail.dm.unipi.it> <86vcn9w7tn.fsf@shell.gmplib.org> Message-ID: <86zkcjuv3p.fsf@shell.gmplib.org> nisse at lysator.liu.se (Niels M?ller) writes: > But first MUL_TOOM43_TO_TOOM54_THRESHOLD needs to be measured and set to > some default. I have pushed in an attempt at tuning code. Please have a look to see if makes sense. I measure a line with an operand size ratio 5/6. On this core2 machine I get #define MUL_TOOM43_TO_TOOM54_THRESHOLD 100 I set the default to 150, since other tuned thresholds seem to be a bit smaller than the defaults. I haven't yet admired the code, but magically it is now tabled with the other thtresholds: http://gmplib.org/devel/thresholds.html -- Torbj?rn From nisse at lysator.liu.se Thu Feb 16 09:42:23 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Thu, 16 Feb 2012 09:42:23 +0100 Subject: toom54 In-Reply-To: <86zkcjuv3p.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Thu, 16 Feb 2012 07:54:02 +0100") References: <86lio6zz44.fsf@shell.gmplib.org> <49377.151.32.244.160.1329151450.squirrel@mail.dm.unipi.it> <86vcn9w7tn.fsf@shell.gmplib.org> <86zkcjuv3p.fsf@shell.gmplib.org> Message-ID: Torbjorn Granlund writes: > I haven't yet admired the code, but magically it is now tabled with the > other thtresholds: > > http://gmplib.org/devel/thresholds.html Nice. Median 124, compared to 85 for TOOM32_TO_TOOM43, and 96 for TOOM42_TO_TOOM63. /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From nisse at lysator.liu.se Thu Feb 16 12:03:04 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Thu, 16 Feb 2012 12:03:04 +0100 Subject: Status update: mini-gmp In-Reply-To: <86k44qnrzk.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Tue, 17 Jan 2012 18:44:47 +0100") References: <86k44qnrzk.fsf@shell.gmplib.org> Message-ID: Torbjorn Granlund writes: > It would be nicer to have it in a separate directory, both for GMP > directory cleanness and for easy extraction by users. Right, I guess that makes the most sense. After copying the mini-gmp directory, I can build gmp with the attached patch. For the functions in dumbmp.c: * Most aren't needed any more. * Functions used in only one file are copied to that file. * The remaining few functions are moved to a new file, named mini-gmp-extra.c, at the gmp top-level. Functions which get by with just mini-gmp include mini-gmp/mini-gmp.c, the rest instead include mini-gmp-extra.c. I did some other minor changes to the files: Use assert rather than ASSERT. Use memmove rather than mem_copyi. Deleted casts of the return value from xmalloc, instead expecting the return value to be of type void *. In mini-gmp-extra.c, I defined xmalloc as an alias for gmp_default_xalloc, an internal function in mini-gmp.c, rather than including yet another copy of the same thing. I'm not sure how to best do the automakery to get mini-gmp included in the gmp distribution. Is it good enough to just add the mini-gmp directory to EXTRA_DIST? Or do we need a list or glob pattern for the wanted files somewhere? I'd like to avoid that backup files, build products etc from the mini-gmp directory are picked up by accident, and I'm not sure how smart the automake rules are for directories listed in EXTRA_DIST. > From GMP's perspective, I don't think mini-gmp unit testing should be > necessary. I see. But it would be nice with a top-level make targat check-mini-gmp, to build and test mini-gmp with the same configuration (compiler, builddir, etc) as set up for gmp. Regards, /Niels -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: mini-gmp.patch.2 URL: -------------- next part -------------- -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From tg at gmplib.org Thu Feb 16 12:11:19 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Thu, 16 Feb 2012 12:11:19 +0100 Subject: Status update: mini-gmp In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Thu\, 16 Feb 2012 12\:03\:04 +0100") References: <86k44qnrzk.fsf@shell.gmplib.org> Message-ID: <86ipj7hw2w.fsf@shell.gmplib.org> nisse at lysator.liu.se (Niels M?ller) writes: After copying the mini-gmp directory, I can build gmp with the attached patch. For the functions in dumbmp.c: Nice! * Most aren't needed any more. * Functions used in only one file are copied to that file. * The remaining few functions are moved to a new file, named mini-gmp-extra.c, at the gmp top-level. Perhaps a better name would be bootstrap.c? Functions which get by with just mini-gmp include mini-gmp/mini-gmp.c, the rest instead include mini-gmp-extra.c. I suppose that's not a truly necessary tweak. Using the strategy of including bootstrap.c/mini-gmp-extra.c in all files might be a bit cleaner. I did some other minor changes to the files: Use assert rather than ASSERT. Use memmove rather than mem_copyi. Deleted casts of the return value from xmalloc, instead expecting the return value to be of type void *. In mini-gmp-extra.c, I defined xmalloc as an alias for gmp_default_xalloc, an internal function in mini-gmp.c, rather than including yet another copy of the same thing. OK. I'm not sure how to best do the automakery to get mini-gmp included in the gmp distribution. Is it good enough to just add the mini-gmp directory to EXTRA_DIST? Or do we need a list or glob pattern for the wanted files somewhere? I don't recall, I do this to seldomly, please make some experiments. > From GMP's perspective, I don't think mini-gmp unit testing should be > necessary. I see. But it would be nice with a top-level make targat check-mini-gmp, to build and test mini-gmp with the same configuration (compiler, builddir, etc) as set up for gmp. Makes sense. -- Torbj?rn From nisse at lysator.liu.se Thu Feb 16 12:23:33 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Thu, 16 Feb 2012 12:23:33 +0100 Subject: Status update: mini-gmp In-Reply-To: <86ipj7hw2w.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Thu, 16 Feb 2012 12:11:19 +0100") References: <86k44qnrzk.fsf@shell.gmplib.org> <86ipj7hw2w.fsf@shell.gmplib.org> Message-ID: Torbjorn Granlund writes: > Perhaps a better name would be bootstrap.c? Makes some sense. At least it's shorter. > I suppose that's not a truly necessary tweak. Using the strategy of > including bootstrap.c/mini-gmp-extra.c in all files might be a bit > cleaner. Depends on whether or not we aim to eliminate that extra file and only use mini-gmp.c. If not, then I agree it's a bit cleaner to include the bootstrap.c file everywhere, and then we can also be less zealous about moving definitions out from that file. And one correction, I wrote: > In mini-gmp-extra.c, I defined xmalloc as an alias for > gmp_default_xalloc, an internal function in mini-gmp.c, rather than > including yet another copy of the same thing. I think that's a sensible thing to do, but that change wasn't actually included in the posted patch. Regards, /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From tg at gmplib.org Thu Feb 16 18:06:53 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Thu, 16 Feb 2012 18:06:53 +0100 Subject: toom54 In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Wed\, 15 Feb 2012 15\:08\:54 +0100") References: <86lio6zz44.fsf@shell.gmplib.org> <49377.151.32.244.160.1329151450.squirrel@mail.dm.unipi.it> <86vcn9w7tn.fsf@shell.gmplib.org> Message-ID: <86ty2q672q.fsf@shell.gmplib.org> nisse at lysator.liu.se (Niels M?ller) writes: I have pushed in an attempt at tuning code. Please have a look to see if makes sense. I measure a line with an operand size ratio 5/6. On this core2 machine I get It looks similar to my old code, which I no longer understand, so it must be right. :-) -- Torbj?rn From marc.glisse at inria.fr Thu Feb 16 22:24:11 2012 From: marc.glisse at inria.fr (Marc Glisse) Date: Thu, 16 Feb 2012 22:24:11 +0100 (CET) Subject: g++-3.4 bug Message-ID: Hello, some tests currently fail on gmpxx with g++-3.4 (on shell). This is due to a bug in g++-3.4, which for l=2 says the following is true: __builtin_constant_p(l) && (l == 0) it is interesting to insert a printf statement that prints both l and l==0 and have it print 2 and true :-/ More recent versions like g++42 seem fine. There are various simple workarounds that make g++34 happy (although in principle they shouldn't change anything), but that seems a bit dangerous (who knows where the bug might be lurking?). My current plan is to replace the test for the existence of __builtin_constant_p from __GNUC__ >= 2 to __GMP_GNUC_PREREQ(4, 2) (or possibly 4.0 or 4.1 if I can get access to either to check if the bug is present). This would only affect the use of __builtin_constant_p in gmpxx.h (not anywhere else, in particular not gmp.h), which is new in gmp-5.1. Does that sound ok? -- Marc Glisse From nisse at lysator.liu.se Fri Feb 17 11:29:09 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Fri, 17 Feb 2012 11:29:09 +0100 Subject: mini-gmp checkin? Message-ID: I've had a look at the dist machinery. Adding a directory to EXTRA_DIST copies the directory and *all* files. Then GMP uses dist-hook to clean up a bit. I ended up adding a couple of the files to EXTRA_DIST, and then a line in dist-hook to also copy mini-gmp/tests/*.[ch]. Seems to work fine, make dist appears to pick up the right files, and make distcheck works. Complete patch is rather large, so I put it at http://www.lysator.liu.se/~nisse/misc/mini-gmp.patch3 I settled for the name bootstrap.c. Otherwise the changes are about the same as in the previous patch. I'm about to check this in. Ok? regards, /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From bodrato at mail.dm.unipi.it Sat Feb 18 17:55:57 2012 From: bodrato at mail.dm.unipi.it (bodrato at mail.dm.unipi.it) Date: Sat, 18 Feb 2012 17:55:57 +0100 (CET) Subject: mini-gmp checkin? In-Reply-To: References: Message-ID: <49248.151.32.167.136.1329584157.squirrel@mail.dm.unipi.it> Ciao, Il Ven, 17 Febbraio 2012 11:29 am, Niels M?ller ha scritto: > I've had a look at the dist machinery. Adding a directory to EXTRA_DIST > copies the directory and *all* files. Then GMP uses dist-hook to clean > up a bit. I ended up adding a couple of the files to EXTRA_DIST, and > then a line in dist-hook to also copy mini-gmp/tests/*.[ch]. Seems to Sounds a bit tricky to me, but it probably is the cleanest way if we want to avoid recursive Makefiles in mini-gmp/ > I'm about to check this in. Ok? Some changes may be needed to mini-gmp, but they can be delayed after the first check-in. E.g. mpn_invert_3by2 and mpn_invert_limb are defined in mini-gmp.h, but they probably should not. Regards, m -- http://bodrato.it/ From nisse at lysator.liu.se Mon Feb 20 10:19:03 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Mon, 20 Feb 2012 10:19:03 +0100 Subject: About mpn_sqr_basecase Message-ID: I've been looking a bit on mpn_sqr_basecase. 1. It uses a temporary array, apparently for systems lacking an mpn_sqr_diag_addlsh1 which can do in-place operation. I think one can support arbitrary sizes with a small temporary area if one collects the off-diagonal terms into the result area, and then computes the diagonal products blockwise, a hundred limbs at a time or so. Current code uses the opposite allocation, with diagonal terms in the result area and the off-diagonal terms in the temporary array. 2. One could do the shifting differently, applying the shift to the limb argument of addmul_1. Something like, when doing the off-diagonal products for up[i], mpn_addmul_1 (rp + 2*i+1, up + i + 1, n - i - 1, (up[i] << 1) + (up[i-1] >> (GMP_LIMB_BITS - 1))); Might be cheaper, if we can get this shifting done in parallel with other operations, and get a simpler carry propagation recurrency when adding diagonal and off-diagonal terms together. 3. The comments on using addmul_2 says that is is tricky. I think that's because the diagonal terms are still 1x1 products. That would get simpler of one treats double-limbs as the units everywhere, having the diagonal computation form the 2x2 products (u0, u1)^2, (u2, u3)^2, ... The total number of limb products ought to be the same, if we do each of these terms as (u0 + u1 B)^2 = u0^2 + B^2 u1^2 + 2B (u0 * u1) = u0^2 + B^2 u1^2 + B (u0 << 1) * u1 + B^2 u1 & HIGH_BIT_TO_MASK(u0) The off-diagonal terms, to be computed with addmul_2, are then (u0, u1) * (u2, u3, ...) (u2, u3) * (u4, u5, ...) ... I guess one can also collect the close-to-diagonal terms B u0 u1 + B^5 u2 u3 + ..., together with the other off-diagonal terms, One would then get the off-diagonal sum B u0*u1 + B^2 (u0, u1) * (u2, u3, ...) B^5 u2*u3 + B^6 (u2, u3) * (u4, u5, ...) which begs for an mpn_addmul_1c accepting an additional carry input limb. Is there such a function? 4. There's code to use mpn_addmul_2s. What is that function supposed to do, is it doing the above sum? Regards, /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From tg at gmplib.org Mon Feb 20 10:40:38 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Mon, 20 Feb 2012 10:40:38 +0100 Subject: About mpn_sqr_basecase In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Mon\, 20 Feb 2012 10\:19\:03 +0100") References: Message-ID: <86linxc06h.fsf@shell.gmplib.org> nisse at lysator.liu.se (Niels M?ller) writes: 1. It uses a temporary array, apparently for systems lacking an mpn_sqr_diag_addlsh1 which can do in-place operation. I think one can support arbitrary sizes with a small temporary area if one collects the off-diagonal terms into the result area, and then computes the diagonal products blockwise, a hundred limbs at a time or so. Why would one want to support such large sizes? Current code uses the opposite allocation, with diagonal terms in the result area and the off-diagonal terms in the temporary array. Old x86 code (p6, k6, k7) might do it like that. The 9 other assembly files for sqr_basecase handle arbitrary operands. 2. One could do the shifting differently, applying the shift to the limb argument of addmul_1. Something like, when doing the off-diagonal products for up[i], mpn_addmul_1 (rp + 2*i+1, up + i + 1, n - i - 1, (up[i] << 1) + (up[i-1] >> (GMP_LIMB_BITS - 1))); Might be cheaper, if we can get this shifting done in parallel with other operations, and get a simpler carry propagation recurrency when adding diagonal and off-diagonal terms together. And then handle carry-out from the last shift a a conditional add_n. 3. The comments on using addmul_2 says that is is tricky. I think that's because the diagonal terms are still 1x1 products. That would get simpler of one treats double-limbs as the units everywhere, having the diagonal computation form the 2x2 products (u0, u1)^2, (u2, u3)^2, ... The total number of limb products ought to be the same, if we do each of these terms as (u0 + u1 B)^2 = u0^2 + B^2 u1^2 + 2B (u0 * u1) = u0^2 + B^2 u1^2 + B (u0 << 1) * u1 + B^2 u1 & HIGH_BIT_TO_MASK(u0) The off-diagonal terms, to be computed with addmul_2, are then (u0, u1) * (u2, u3, ...) (u2, u3) * (u4, u5, ...) ... I guess one can also collect the close-to-diagonal terms B u0 u1 + B^5 u2 u3 + ..., together with the other off-diagonal terms, One would then get the off-diagonal sum B u0*u1 + B^2 (u0, u1) * (u2, u3, ...) B^5 u2*u3 + B^6 (u2, u3) * (u4, u5, ...) which begs for an mpn_addmul_1c accepting an additional carry input limb. Is there such a function? I beg to differ about the greatness of this approach. It might make some parts of the code look simpler, but will it be faster? The C sqr_basecase code is a bit hairy, but mainly because of the many variants, but the many variants are there for best performance everywhere. 4. There's code to use mpn_addmul_2s. What is that function supposed to do, is it doing the above sum? No. It is a addmul_2 which suppresses a final mul to avoid including the diagonal product. I don't think any processor explicitly provides it (although it is mentioned in several assembly files). I have explicit such sparc64 code in a repo someplace. -- Torbj?rn From rguenther at suse.de Mon Feb 20 10:46:01 2012 From: rguenther at suse.de (Richard Guenther) Date: Mon, 20 Feb 2012 10:46:01 +0100 (CET) Subject: g++-3.4 bug In-Reply-To: References: Message-ID: On Thu, 16 Feb 2012, Marc Glisse wrote: > Hello, > > some tests currently fail on gmpxx with g++-3.4 (on shell). This is due to a > bug in g++-3.4, which for l=2 says the following is true: > __builtin_constant_p(l) && (l == 0) > it is interesting to insert a printf statement that prints both l and l==0 and > have it print 2 and true :-/ Not for me. int main () { int l = 2; if (__builtin_constant_p (l) && (l == 0)) return 1; return 0; } Richard. From nisse at lysator.liu.se Mon Feb 20 11:59:55 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Mon, 20 Feb 2012 11:59:55 +0100 Subject: About mpn_sqr_basecase In-Reply-To: <86linxc06h.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Mon, 20 Feb 2012 10:40:38 +0100") References: <86linxc06h.fsf@shell.gmplib.org> Message-ID: Torbjorn Granlund writes: > Why would one want to support such large sizes? It would be nice to get rid of the SQR_TOOM2_THRESHOLD size restriction in *all* squaring code. Since that restriction makes things a bit brittle, and causes additional complexities for the fat case. I see good good reason besides that. Switching the use of the allocation areas seems like a simple way to get rid of that. I've been looking mostly at the C sqr_basecase. > 2. One could do the shifting differently, applying the shift to the limb > argument of addmul_1. Something like, when doing the off-diagonal > products for up[i], > > mpn_addmul_1 (rp + 2*i+1, up + i + 1, n - i - 1, > (up[i] << 1) + (up[i-1] >> (GMP_LIMB_BITS - 1))); > > Might be cheaper, if we can get this shifting done in parallel with > other operations, and get a simpler carry propagation recurrency when > adding diagonal and off-diagonal terms together. > > And then handle carry-out from the last shift a a conditional add_n. It's not that bad. In this scheme, up[i] (shifted) is multiplied by the n-1-i limbs up[i+1, ..., n-1], i.e., fewer as i increases. The final off-diagonal iteration (i = n-2) then adds up[n-2] * up[n-1], so if we shift up[n-2], we only need a conditional add of the single limb up[n-1]. [On use of addmul_2]: > I beg to differ about the greatness of this approach. Care to elaborate on why you expect it to be to slow? I imagine carry handling for the close-to-diagonal terms up[2k] * up[2k+1] will be slow without assembler support. Or should we postpone this discussion until there's some code to compare? BTW, what do you think about the mpn_addmul_1c entrypoint? Would it make sense with addmul_2c as well? addmul_1c is declared in gmp-impl.h, and it seems it's implemented on some x86_32 configurations and on powerpc64. Regards, /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From tg at gmplib.org Mon Feb 20 12:12:22 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Mon, 20 Feb 2012 12:12:22 +0100 Subject: Sandybridge addmul_N challenge Message-ID: <868vjxbvxl.fsf@shell.gmplib.org> The two high-end architectures for GMP are AMD K8-K10 (i.e., all Opteron except 62xx, Athlon 64, Athlon X2, Phenom, Phenom II) and Intel Sandybridge (i.e., socket 1155 and 2011 Core i3,i5,i7). We have great multiplication loops for K8-K10, addmul_1 runs at 2.5 c/l and addmul_2 runs at 2.375 c/l. (These loops are then used in mul_basecaee, sqr_basecase, redc_1, redc_2, and a few other places.) But our multiplication loops for Sandybridge are much worse. The current addmul_1 runs at 4 c/l and addmul_2 runs at about 3.4 c/l. I have new code running at 3.4 c/l and 3.3 c/l respectively. The critical instructions for these loops are MUL and ADC. The throughput of MUL is great for both CPUs (actually better on Sandybridge than K8-K10). ADC is more tricky; AMD can issue 3 per cycle with a latency of 1 cycle, bit for Intel the situation is trickier: In all cases the carry-out has a latency of 1 cycle. For "ADC $0,dreg" the latency of dreg is one cycle, but for "ADC sreg,dreg" it is two cycles. (It is also 2 cycles for "ADC $const,dreg" when const != 0.) The challenge is to beat 3 c/l with either addmul_1 or addmul_2. Success will boost GMP's general performance on these processors, since every higher-level operation depends on these lowest-level multiply primitives. -- Torbj?rn From foxmuldrster at yahoo.com Mon Feb 20 12:19:32 2012 From: foxmuldrster at yahoo.com (Rick Hodgin) Date: Mon, 20 Feb 2012 03:19:32 -0800 (PST) Subject: Sandybridge addmul_N challenge Message-ID: <1329736772.87971.androidMobile@web125402.mail.ne1.yahoo.com> What source file and line? Best regards, Rick C. Hodgin From marc.glisse at inria.fr Mon Feb 20 20:39:43 2012 From: marc.glisse at inria.fr (Marc Glisse) Date: Mon, 20 Feb 2012 20:39:43 +0100 (CET) Subject: g++-3.4 bug In-Reply-To: References: Message-ID: On Mon, 20 Feb 2012, Richard Guenther wrote: > On Thu, 16 Feb 2012, Marc Glisse wrote: > >> Hello, >> >> some tests currently fail on gmpxx with g++-3.4 (on shell). This is due to a >> bug in g++-3.4, which for l=2 says the following is true: >> __builtin_constant_p(l) && (l == 0) >> it is interesting to insert a printf statement that prints both l and l==0 and >> have it print 2 and true :-/ > > Not for me. > > int main () { int l = 2; if (__builtin_constant_p (l) && (l == 0)) return > 1; return 0; } Did you try getting a snapshot of the gmp repos and running the c++ testsuite? Yes, your simple example passes, but you know better than I do how much the context matters for optimizations. And since the bug doesn't seem reproducible with more recent versions of gcc, there is little motivation to reduce the failing tests. The failing compiler was 3.4.6, from ports on freebsd 8.1, with -O2 -m64. 4.2.5 seems good. In gmpxx.h, you can do this change: struct __gmp_binary_lshift // LINE 425 { static void eval(mpz_ptr z, mpz_srcptr w, mp_bitcnt_t l) { if (__GMPXX_CONSTANT(l) && (l == 0)) { std::cerr << l << '\n'; // NEW LINE if (z != w) mpz_set(z, w); } (and replace with at the beginning) and wonder why it prints "2"... -- Marc Glisse From rguenther at suse.de Mon Feb 20 20:48:41 2012 From: rguenther at suse.de (Richard Guenther) Date: Mon, 20 Feb 2012 20:48:41 +0100 (CET) Subject: g++-3.4 bug In-Reply-To: References: Message-ID: On Mon, 20 Feb 2012, Marc Glisse wrote: > On Mon, 20 Feb 2012, Richard Guenther wrote: > > > On Thu, 16 Feb 2012, Marc Glisse wrote: > > > > > Hello, > > > > > > some tests currently fail on gmpxx with g++-3.4 (on shell). This is due to > > > a > > > bug in g++-3.4, which for l=2 says the following is true: > > > __builtin_constant_p(l) && (l == 0) > > > it is interesting to insert a printf statement that prints both l and l==0 > > > and > > > have it print 2 and true :-/ > > > > Not for me. > > > > int main () { int l = 2; if (__builtin_constant_p (l) && (l == 0)) return > > 1; return 0; } > > Did you try getting a snapshot of the gmp repos and running the c++ testsuite? > Yes, your simple example passes, but you know better than I do how much the > context matters for optimizations. And since the bug doesn't seem reproducible > with more recent versions of gcc, there is little motivation to reduce the > failing tests. Ah, ok - I thought you might have one. I'm not really interested in GCC 3.4.x bugs either - after all this version has been out of maintainance for six years... > The failing compiler was 3.4.6, from ports on freebsd 8.1, with -O2 -m64. > 4.2.5 seems good. ... has have 4.2.x and 4.3.x. But it seems freebsd is stuck with 4.2.2, the last release with GPLv2. I suppose for freebsd testing should focus on LLVM. Richard. From tg at gmplib.org Mon Feb 20 20:59:00 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Mon, 20 Feb 2012 20:59:00 +0100 Subject: g++-3.4 bug In-Reply-To: (Richard Guenther's message of "Mon\, 20 Feb 2012 20\:48\:41 +0100 \(CET\)") References: Message-ID: <86ehtp46pn.fsf@shell.gmplib.org> Richard Guenther writes: ... has have 4.2.x and 4.3.x. But it seems freebsd is stuck with 4.2.2, the last release with GPLv2. I suppose for freebsd testing should focus on LLVM. I think differently. I am developing GMP on FreeBSD systems, and use gcc there. I don't waste any time on LLVM. It is a pity they make this play about GPLv3, but I don't think they'll get out of their block for many years. In the meantime, one may always install things from /usr/ports/lang/gcc*. -- Torbj?rn From marc.glisse at inria.fr Mon Feb 20 21:30:38 2012 From: marc.glisse at inria.fr (Marc Glisse) Date: Mon, 20 Feb 2012 21:30:38 +0100 (CET) Subject: g++-3.4 bug In-Reply-To: References: Message-ID: On Mon, 20 Feb 2012, Richard Guenther wrote: > Ah, ok - I thought you might have one. I'm not really interested in > GCC 3.4.x bugs either - after all this version has been out of > maintainance for six years... Yes, my main concern is whether I should let people notice that the testsuite is failing so they try a more recent compiler, or work around it by disabling the use of __builtin_constant_p for 3.4.* (and anything older?). > ... has have 4.2.x and 4.3.x. But it seems freebsd is stuck with 4.2.2, > the last release with GPLv2. I suppose for freebsd testing should focus > on LLVM. which last I checked didn't work with the master repos ;-) (it doesn't understand the instruction jb,pt in mpn/x86_64/mod_34lsub1.asm (and neither does oracle)) Note that people who accept gplv2 but not gplv3 are fairly likely to be unhappy with gmp anyway... -- Marc Glisse From tg at gmplib.org Mon Feb 20 21:49:53 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Mon, 20 Feb 2012 21:49:53 +0100 Subject: g++-3.4 bug In-Reply-To: (Marc Glisse's message of "Mon\, 20 Feb 2012 21\:30\:38 +0100 \(CET\)") References: Message-ID: <86aa4d44cu.fsf@shell.gmplib.org> Marc Glisse writes: Yes, my main concern is whether I should let people notice that the testsuite is failing so they try a more recent compiler, or work around it by disabling the use of __builtin_constant_p for 3.4.* (and anything older?). If just the test suite is miscompiled, and the compiler is actually still used, then we might as well make a (trivial) workaround in the test suite. Note that we resisted the temptation to work around the GCC 4.3.2 bug that made mpn/generic/rootrem.c become miscompiled. In this case, we had a compiler which was very much used, but no reasonable workaround was found. which last I checked didn't work with the master repos ;-) (it doesn't understand the instruction jb,pt in mpn/x86_64/mod_34lsub1.asm (and neither does oracle)) I suppose we could as well remove it. Now done. -- Torbj?rn From marc.glisse at inria.fr Tue Feb 21 21:31:00 2012 From: marc.glisse at inria.fr (Marc Glisse) Date: Tue, 21 Feb 2012 21:31:00 +0100 (CET) Subject: g++-3.4 bug In-Reply-To: <86aa4d44cu.fsf@shell.gmplib.org> References: <86aa4d44cu.fsf@shell.gmplib.org> Message-ID: On Mon, 20 Feb 2012, Torbjorn Granlund wrote: > Marc Glisse writes: > > Yes, my main concern is whether I should let people notice that the > testsuite is failing so they try a more recent compiler, or work > around it by disabling the use of __builtin_constant_p for 3.4.* (and > anything older?). > > If just the test suite is miscompiled, libgmp.* and libgmpxx.* seem fine. > and the compiler is actually still used, then we might as well make a > (trivial) workaround in the test suite. I am not sure what you mean. Note that as is, a g++34 user who multiplies a mpz_class by 4u (what the testsuite does) in his code is likely to hit the bug. I am not really happy with hiding the testsuite failure, I'd rather either let the testsuite noisily fail, or (trivially) work around the bug in gmpxx.h so the user's code is safe too. (now that I think about it, there is a testsuite failure on solaris (likely with g++-3.4.3) visible at http://hydra.nixos.org/build/2112917 when creating a mpz_class from an mpz_t, 3.4 is really an unlucky number) > which last I checked didn't work with the master repos ;-) > (it doesn't understand the instruction jb,pt in > mpn/x86_64/mod_34lsub1.asm (and neither does oracle)) > > I suppose we could as well remove it. Now done. Thanks. -- Marc Glisse From nisse at lysator.liu.se Tue Feb 21 21:52:36 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Tue, 21 Feb 2012 21:52:36 +0100 Subject: About mpn_sqr_basecase In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Mon, 20 Feb 2012 11:59:55 +0100") References: <86linxc06h.fsf@shell.gmplib.org> Message-ID: I wrote: >> 2. One could do the shifting differently, applying the shift to the limb >> argument of addmul_1. Torbj?rn: >> And then handle carry-out from the last shift a a conditional add_n. I: > It's not that bad. In this scheme, up[i] (shifted) is multiplied by the > n-1-i limbs up[i+1, ..., n-1], i.e., fewer as i increases. The final > off-diagonal iteration (i = n-2) then adds up[n-2] * up[n-1], so if we > shift up[n-2], we only need a conditional add of the single limb > up[n-1]. It turns out there are a more conditional adds than I thought. When we call addmul_1 with a shifted limb, the next addmul_1 is shorter. So adding the carry to the next addmul_1 call is *almost* enough, but we also have to apply it to the limb not included in that next addmul_1 call. It boils down to a conditional add to the next diagonal product. Implementation below. It's quite nice and simple, a single loop which in each iteration computes one diagonal term, calls addmul_1c for off-diagonal terms (that's the bulk of the work, naturally), and then a bit of extra fiddling to get the shifting correct. If this algorithm can be faster than the current mpn_addmul_1-based code is less clear. It's the O(n) work of MPN_SQR_DIAG_ADDLSH1 compared to the additional O(n) work done for on-the-fly shifting. One could hope that the additional shift and carry logic can be scheduled in parallel with the multiplication work in (the previous) call of addmul_1c. What's the bottleneck for addmul_1, is it multiplier throughput, carry propagation latency, or decoder bandwidth? Regards, /Niels #define sqr_2(r3, r2, r1, r0, u1, u0) do { \ mp_limb_t __p1, __p0, __t, __cy, __u1p; \ umul_ppmm (__t, (r0), (u0), (u0)); \ umul_ppmm (__p1, __p0, (u0) << 1, (u1)); \ __cy = (u0) >> (GMP_LIMB_BITS - 1); \ add_ssaaaa (__t, (r1), __p1, __p0, 0, __t); \ __u1p = (u1) + __cy; \ __cy = __u1p < __cy; \ umul_ppmm (__p1, __p0, (u1), __u1p); \ add_ssaaaa ((r3), (r2), __p1, __p0, -__cy, __t); \ } while (0) /* Squaring with on-the-fly shifting for the off-diagonal elements. Let uc[i] = up[i] >> (GMP_LIMB_BITS - 1) us[0] = 2 up[0] mod B, us[i] = 2 up[i] mod B + uc[i-1], for i > 0 In the first iteration, we compute up[0] * up[0] + B us[0] * In iteration i, we add in B^{2i} (up[i] * (up[i] + uc[i-1]) + B us[i] * ) We have an unlikely carry from the addition up[i] + uc[i-1]. The current handling is a bit tricky. A simpler alternative is to compute the product up[i]^2, and conditionally add in up[i] to the result. */ void sqr_basecase_1 (mp_ptr rp, mp_srcptr up, mp_size_t n) { mp_size_t i; mp_limb_t ul, ulp, uh, p2, p1, p0, c1, c0, t; if (n == 1) { mp_limb_t ul = up[0]; umul_ppmm (rp[1], rp[0], ul, ul); return; } else if (n == 2) { mp_limb_t u0, u1; u0 = up[0]; u1 = up[1]; sqr_2 (rp[3], rp[2], rp[1], rp[0], u1, u0); return; } ul = up[0]; umul_ppmm (p1, rp[0], ul, ul); rp[n] = mpn_mul_1c (rp+1, up+1, n-1, ul<<1, p1); for (i = 1; i < n-2; i++) { c0 = ul >> (GMP_LIMB_BITS - 1); ul = up[i]; ulp = ul + c0; c1 = ulp < c0; umul_ppmm (p1, p0, ul, ulp); add_ssaaaa (p1, rp[2*i], p1, p0, -c1, rp[2*i]); rp[n+i] = mpn_addmul_1c (rp + 2*i+1, up + i + 1, n - i - 1, (ul << 1) + c0, p1); } /* Handle i = n-2 */; c0 = ul >> (GMP_LIMB_BITS - 1); ul = up[n-2]; ulp = ul + c0; c1 = ulp < c0; umul_ppmm (p1, p0, ul, ulp); add_ssaaaa (p1, rp[2*n-4], p1, p0, -c1, rp[2*n-4]); uh = up[n-1]; umul_ppmm (p2, p0, (ul << 1) + c0, uh); ADDC_LIMB (c0, t, p1, rp[2*n-3]); add_ssaaaa (p2, rp[2*n-3], p2, p0, c0, t); /* Handle i = n-1 */ c0 = ul >> (GMP_LIMB_BITS - 1); ulp = uh + c0; c1 = ulp < c0; umul_ppmm (p1, p0, uh, ulp); add_ssaaaa (rp[2*n-1], rp[2*n-2], p1, p0, -c1, p2); } -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From nisse at lysator.liu.se Wed Feb 22 11:18:52 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Wed, 22 Feb 2012 11:18:52 +0100 Subject: Sandybridge addmul_N challenge In-Reply-To: <868vjxbvxl.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Mon, 20 Feb 2012 12:12:22 +0100") References: <868vjxbvxl.fsf@shell.gmplib.org> Message-ID: Torbjorn Granlund writes: > But our multiplication loops for Sandybridge are much worse. The > current addmul_1 runs at 4 c/l and addmul_2 runs at about 3.4 c/l. I > have new code running at 3.4 c/l and 3.3 c/l respectively. Let me think aloud for a bit. I'm still not sure what the addmul_1 bottleneck is. I note that in the general x86_64 and core2 code, for each mul, we do two *dependent* adds before storing the low limb. With the cin and cout for the carrry limbs, the operations per limb are basically mov (up), %rax mul v add %rax, cin xor cout, cout adc $0, cout add cin, (rp) adc %rdx, cout The real code is a lot more clever with unrolling and scheduling, but if I read it correctly, these are the operations done. One could reorder this as follows, mov (up), %rax mul v xor cout, cout add (rp), cin adc $0, cout add %rax, cin adc %rdx, cout mov cin, (rp) Then we get a bit more independence of operations, since the first add is independent of the result of the mul. It costs an instruction with the extra mov to rp. But I suspect that's still better than updating memory twice, like mov (up), %rax mul v xor cout, cout add cin, (rp) adc $0, cout add %rax, (rp) adc %rdx, cout The recurrency depth seems to be the same in all cases, though, with a latency of add + adc + adc from cin to cout. If that's what's killing performance, maybe this would be better, mov (up), %rax mul v xor cout, cout add (rp), %rax adc %rdx, cout add %rax, cin adc $0, cout mov cin, (rp) with a recurrency latency of only add + adc (where the adc in question has a $0 source operand). If I understand you correctly, that would be only two cycles on sandybridge. It seems the current sbr code does something similar? Then we have to rely on scheduling or out-of-order execution to not get a useless wait between the mul and the first add. Is instruction issue still limited to three instructions per cycle? Then that's a more narrow bottleneck than both latency and multiplication throughput. The above is eight instructions per limb, excluding looping. So one could at least *hope* to get down to 3 cycles with a moderate level of unrolling. Maybe one could try a variant updating (rp) twice, to save an instruction: mov (up), %rax mul v xor cout, cout add %rax, (rp) adc %rdx, cout add cin, (rp) adc $0, cout Is it possible to squeeze those three memory instructions in less than 3 cycles? If we try to get the instruction count below 9 (like above), then there's not much room for moving around %rax and %rdx. But the critical recurrency, "add cin, ...; adc $0, cout", doesn't involve %rax and %rdx, leaving a bit freedom to move it between iterations if we unroll and use multiple registers for cin and cout. Hmm, let me give one more variant, which moves %rax and %rdx out of the way, at the cost of one more instruction (8 instructions per limb): mov (up), %rax mul v mov %rax, pl mov %rdx, cout add pl, (rp) adc $0, cout add cin, (rp) adc $0, cout I guess one ought to bring out the loop mixer to find out if any of this really can run at 3 cycles. Regards, /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From tg at gmplib.org Wed Feb 22 16:30:55 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Wed, 22 Feb 2012 16:30:55 +0100 Subject: Sandybridge addmul_N challenge In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Wed\, 22 Feb 2012 11\:18\:52 +0100") References: <868vjxbvxl.fsf@shell.gmplib.org> Message-ID: <86sji2khqo.fsf@shell.gmplib.org> nisse at lysator.liu.se (Niels M?ller) writes: The recurrency depth seems to be the same in all cases, though, with a latency of add + adc + adc from cin to cout. If that's what's killing performance, maybe this would be better, The x86_64/mul_1.asm code has adc+adc with register operands on the recurrency chain, i.e., 2 cycles on any AMD chip and 4 cycles on any Intel chip (except Pentium4 which has about 20 cycles). mov (up), %rax mul v xor cout, cout add (rp), %rax adc %rdx, cout add %rax, cin adc $0, cout mov cin, (rp) with a recurrency latency of only add + adc (where the adc in question has a $0 source operand). If I understand you correctly, that would be only two cycles on sandybridge. It seems the current sbr code does something similar? Then we have to rely on scheduling or out-of-order execution to not get a useless wait between the mul and the first add. I think your code has indeed a 2c recurrency path. It is similar to the sandybridge code, except that it turned out to be better to use adc $0 in both places possible. Is instruction issue still limited to three instructions per cycle? Then that's a more narrow bottleneck than both latency and multiplication throughput. The above is eight instructions per limb, excluding looping. So one could at least *hope* to get down to 3 cycles with a moderate level of unrolling. I don't think Intel's processors like "op mem,reg" very much. I tried the code above (with the loop mixer) and its runs slower than the code I checked in (which runs slightly faster than claimed, 3.25 c/l, mot 3.4 c/l). We still have 3 insn/c, except that some adjacent insn pairs are fused. I don't know exactly which, but test+jcc and cmp+jcc might be the only ones. My new addmul_1 has 38 insn, no fusing expected so 3.17 c/l is a decoder imposed limit. Maybe one could try a variant updating (rp) twice, to save an instruction: mov (up), %rax mul v xor cout, cout add %rax, (rp) adc %rdx, cout add cin, (rp) adc $0, cout Is it possible to squeeze those three memory instructions in less than 3 cycles? Intel processors likes "op reg,mem" even less... If we try to get the instruction count below 9 (like above), then there's not much room for moving around %rax and %rdx. But the critical recurrency, "add cin, ...; adc $0, cout", doesn't involve %rax and %rdx, leaving a bit freedom to move it between iterations if we unroll and use multiple registers for cin and cout. Hmm, let me give one more variant, which moves %rax and %rdx out of the way, at the cost of one more instruction (8 instructions per limb): mov (up), %rax mul v mov %rax, pl mov %rdx, cout add pl, (rp) adc $0, cout add cin, (rp) adc $0, cout I guess one ought to bring out the loop mixer to find out if any of this really can run at 3 cycles. The sandybridge machine tom behind shell is waiting for you. A nehalem machine is biko* (lots of Xen machines on the same hardware) and repentium is a Conroe ("core2"). I recently played with karatsuba-based addmul_2 for s390. For MUL- challenged machines, that might be an option. Sandybridge is OK there, but Bull-dozer is not. A disadvantage is that such code would be unsuitable for code which wants to be "side-channel silent". -- Torbj?rn From tg at gmplib.org Wed Feb 22 17:03:54 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Wed, 22 Feb 2012 17:03:54 +0100 Subject: Sandybridge addmul_N challenge In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Wed\, 22 Feb 2012 11\:18\:52 +0100") References: <868vjxbvxl.fsf@shell.gmplib.org> Message-ID: <86ipiykg7p.fsf@shell.gmplib.org> I doubt we can make addmul_1 run faster on sandybridge. But I'd like mul_basecase to run much faster than 3 c/l. Then sqr_basecase and redc_1, redc_2 should be fixed. An addmul_2 running better at 3 c/l or better would be great. That means we need to handle a "tick" in it using <= 17 insns, probably avoiding "op reg,mem" an "op mem.reg". (If we use 18 insn, loop handling will bring us over 3 c/l.) For mul_basecase and sqr_basecase we could perhaps work vertically, summing into 3 registers. I.e., pretend we really multiply polynomials, performing no (recurrency) carry propagation until we reach the bottom. I havn't tried this, but I think this might be really promising for Intel's last *two* main generations (Nehalem/Westmere and Sandybridge/Ivybridge). Perhaps would could get close to 2 c/l with this approach, unless register shortage messes things up to badly. I haven't thought about doing redc_1/redc_2 using this approach. Hensel lifting on-the-fly could be interesting... -- Torbj?rn From nisse at lysator.liu.se Wed Feb 22 19:16:23 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Wed, 22 Feb 2012 19:16:23 +0100 Subject: Sandybridge addmul_N challenge In-Reply-To: <86sji2khqo.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Wed, 22 Feb 2012 16:30:55 +0100") References: <868vjxbvxl.fsf@shell.gmplib.org> <86sji2khqo.fsf@shell.gmplib.org> Message-ID: Torbjorn Granlund writes: > I recently played with karatsuba-based addmul_2 for s390. For MUL- > challenged machines, that might be an option. Sandybridge is OK there, > but Bull-dozer is not. A disadvantage is that such code would be > unsuitable for code which wants to be "side-channel silent". I'm pretty sure one can do Karatsuba branchfree. Not sure one can do that and do it fast (but if the alternative is branches, they're costly too). Evaluation: mov u0, a sub u1, a lea (u0, u1), t cmovc t, a sbb mask, mask Now a = |u0 - u1|, with sign in mask. Can easily be done also without cmov, but with a longer chain of dependent instructions: mov u0, a sub u1, a sbb mask, mask xor mask, a sub mask, a Get b = |v0 - v1| in the same way, and arrange so that the final mask is all ones if the term |u0 - u1| * |v0 - v1| should be subtracted, i.e., if (u0 - u1)(v0 - v1) >= 0. Interpolation: Add in the signed term (u0 - u1) * (v0 - v1) to the three limbs , using two's complement: mov a, %rax mul b xor mask, %rax xor mask, %rdx bt $0, mask C Set carry from mask. Any better way? C + c = - (u0 - u1) (v0 - v1), in two's complement. adc %rax, r1 adc %rdx, r2 adc mask, r3 Alternatively, one could use (u0 + u1) * (v0 + v1), with a couple of conditional adds instead. /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From marc.glisse at inria.fr Wed Feb 22 19:18:58 2012 From: marc.glisse at inria.fr (Marc Glisse) Date: Wed, 22 Feb 2012 19:18:58 +0100 (CET) Subject: _mp_alloc vs ALLOC Message-ID: Hello, is there any objection if I replace most uses of ->_mp_alloc by calls to the ALLOC macro in mp[zqf] (and similarly for _mp_size, etc)? It helps when experimenting... I am also considering moving the NUM and DEN macros from test/mpq/t-cmp* to gmp-impl.h, since I assume mpq_numref and mpq_denref are not used much internally because of their length. By the way, is there any difference between PTR and LIMBS? Say one that should be used in some circumstances and one in others? Unrelated, I was thinking of changing (when gmp is compiled with a C++ compiler, so that wouldn't affect many people...) the definitions of TMP_DECL and TMP_FREE so TMP_DECL would create a variable whose destructor (executed when the variable goes out of scope, which shouldn't be far from where TMP_FREE is currently called) does what TMP_FREE currently does. The advantage is that in case an exception is thrown in between, the destructor is executed. That doesn't solve all memory issues by far, but it is a first step that costs little in terms of code and 0 in speed. I am not saying I will do either any time soon, just checking first. -- Marc Glisse From tg at gmplib.org Wed Feb 22 19:41:17 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Wed, 22 Feb 2012 19:41:17 +0100 Subject: _mp_alloc vs ALLOC In-Reply-To: (Marc Glisse's message of "Wed\, 22 Feb 2012 19\:18\:58 +0100 \(CET\)") References: Message-ID: <86d396soc2.fsf@shell.gmplib.org> Marc Glisse writes: is there any objection if I replace most uses of ->_mp_alloc by calls to the ALLOC macro in mp[zqf] (and similarly for _mp_size, etc)? It helps when experimenting... I am also considering moving the NUM and DEN macros from test/mpq/t-cmp* to gmp-impl.h, since I assume mpq_numref and mpq_denref are not used much internally because of their length. By the way, is there any difference between PTR and LIMBS? Say one that should be used in some circumstances and one in others? You're welcome to clean up this. The macro LIMBS is used in just one file, AFAICT, I have no idea why it exists Unrelated, I was thinking of changing (when gmp is compiled with a C++ compiler, so that wouldn't affect many people...) the definitions of TMP_DECL and TMP_FREE so TMP_DECL would create a variable whose destructor (executed when the variable goes out of scope, which shouldn't be far from where TMP_FREE is currently called) does what TMP_FREE currently does. The advantage is that in case an exception is thrown in between, the destructor is executed. That doesn't solve all memory issues by far, but it is a first step that costs little in terms of code and 0 in speed. That'd be fine too. -- Torbj?rn From bodrato at mail.dm.unipi.it Wed Feb 22 20:28:52 2012 From: bodrato at mail.dm.unipi.it (bodrato at mail.dm.unipi.it) Date: Wed, 22 Feb 2012 20:28:52 +0100 (CET) Subject: _mp_alloc vs ALLOC In-Reply-To: <86d396soc2.fsf@shell.gmplib.org> References: <86d396soc2.fsf@shell.gmplib.org> Message-ID: <49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it> Ciao, Il Mer, 22 Febbraio 2012 7:41 pm, Torbjorn Granlund ha scritto: > Marc Glisse writes: > their length. By the way, is there any difference between PTR and > LIMBS? Say one that should be used in some circumstances and one in > others? > > You're welcome to clean up this. The macro LIMBS is used in just one > file, AFAICT, I have no idea why it exists I suspect there are other unused macros hanging in gmp-impl.h ... > Unrelated, I was thinking of changing (when gmp is compiled with a C++ ... > TMP_DECL and TMP_FREE so TMP_DECL would create a variable whose Unrelated :-) We might define more macros like TMP_ALLOC_LIMBS_2 . I mean _3 and _4. So that they can be used to reduce the number of allocations. Do you agree? (I just touched mpz/gcdext.c, and _4 should be used there). Regards, m -- http://bodrato.it/ From tg at gmplib.org Wed Feb 22 20:32:18 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Wed, 22 Feb 2012 20:32:18 +0100 Subject: _mp_alloc vs ALLOC In-Reply-To: <49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it> (bodrato@mail.dm.unipi.it's message of "Wed\, 22 Feb 2012 20\:28\:52 +0100 \(CET\)") References: <86d396soc2.fsf@shell.gmplib.org> <49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it> Message-ID: <86zkcar7el.fsf@shell.gmplib.org> bodrato at mail.dm.unipi.it writes: Unrelated :-) We might define more macros like TMP_ALLOC_LIMBS_2 . I mean _3 and _4. So that they can be used to reduce the number of allocations. Do you agree? (I just touched mpz/gcdext.c, and _4 should be used there). I'd vote for killing TMP_ALLOC_LIMBS_2 rather than add TMP_ALLOC_LIMBS_N for some range of N. Please look at the generated code from TMP_ALLOC from any reasonable compiler. It is a sub from sp, the a copy from sp to the target variable. Cost: about 1 cycle. TMP_ALLOC_LIMBS_2 is clutter IMHO. -- Torbj?rn From nisse at lysator.liu.se Wed Feb 22 20:57:31 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Wed, 22 Feb 2012 20:57:31 +0100 Subject: _mp_alloc vs ALLOC In-Reply-To: <86zkcar7el.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Wed, 22 Feb 2012 20:32:18 +0100") References: <86d396soc2.fsf@shell.gmplib.org> <49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it> <86zkcar7el.fsf@shell.gmplib.org> Message-ID: Torbjorn Granlund writes: > TMP_ALLOC_LIMBS_2 is clutter IMHO. Sure, it's pointless in a normal build. As I understand it, the reason for having TMP_ALLOC_LIMBS_2 is to make --enable-alloca=debug more effective, by getting some kind of red zone separating the two areas. Whether or not that's worth the clutter, I'm not sure. Regards, /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From tg at gmplib.org Wed Feb 22 21:02:55 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Wed, 22 Feb 2012 21:02:55 +0100 Subject: _mp_alloc vs ALLOC In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Wed\, 22 Feb 2012 20\:57\:31 +0100") References: <86d396soc2.fsf@shell.gmplib.org> <49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it> <86zkcar7el.fsf@shell.gmplib.org> Message-ID: <86pqd6r5zk.fsf@shell.gmplib.org> nisse at lysator.liu.se (Niels M?ller) writes: Torbjorn Granlund writes: > TMP_ALLOC_LIMBS_2 is clutter IMHO. Sure, it's pointless in a normal build. As I understand it, the reason for having TMP_ALLOC_LIMBS_2 is to make --enable-alloca=debug more effective, by getting some kind of red zone separating the two areas. Whether or not that's worth the clutter, I'm not sure. Surely a plain TMP_ALLOC adds red zones? If not, that is something we ought to fix. -- Torbj?rn From marc.glisse at inria.fr Wed Feb 22 21:20:23 2012 From: marc.glisse at inria.fr (Marc Glisse) Date: Wed, 22 Feb 2012 21:20:23 +0100 (CET) Subject: _mp_alloc vs ALLOC In-Reply-To: References: <86d396soc2.fsf@shell.gmplib.org> <49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it> <86zkcar7el.fsf@shell.gmplib.org> Message-ID: On Wed, 22 Feb 2012, Torbjorn Granlund wrote: > bodrato at mail.dm.unipi.it writes: > > Unrelated :-) We might define more macros like TMP_ALLOC_LIMBS_2 . I mean > _3 and _4. So that they can be used to reduce the number of allocations. > Do you agree? (I just touched mpz/gcdext.c, and _4 should be used there). > > I'd vote for killing TMP_ALLOC_LIMBS_2 rather than add TMP_ALLOC_LIMBS_N > for some range of N. > > Please look at the generated code from TMP_ALLOC from any reasonable > compiler. It is a sub from sp, the a copy from sp to the target > variable. Cost: about 1 cycle. That's for the alloca case. Without alloca, one call to malloc is better than two (although that usually also means the numbers are big and any gmp operation will dwarf allocation). Also, the threshold between alloca and malloc is quite high, and with many separate allocations that all barely fit below this threshold, the total amount of stack memory used can become too large for some applications (lowering the threshold may be easier than allocating things in groups though). On Wed, 22 Feb 2012, Niels M?ller wrote: > Torbjorn Granlund writes: > >> TMP_ALLOC_LIMBS_2 is clutter IMHO. > > Sure, it's pointless in a normal build. > > As I understand it, the reason for having TMP_ALLOC_LIMBS_2 is to make > --enable-alloca=debug more effective, by getting some kind of red zone > separating the two areas. Whether or not that's worth the clutter, I'm > not sure. Er, I guess you mean TMP_ALLOC_LIMBS_2 as opposed to a single call to TMP_ALLOC_LIMBS manually split in two, not as opposed to 2 calls to TMP_ALLOC_LIMBS. -- Marc Glisse From tg at gmplib.org Wed Feb 22 21:24:06 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Wed, 22 Feb 2012 21:24:06 +0100 Subject: _mp_alloc vs ALLOC In-Reply-To: (Marc Glisse's message of "Wed\, 22 Feb 2012 21\:20\:23 +0100 \(CET\)") References: <86d396soc2.fsf@shell.gmplib.org> <49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it> <86zkcar7el.fsf@shell.gmplib.org> Message-ID: <86linur509.fsf@shell.gmplib.org> Marc Glisse writes: That's for the alloca case. Without alloca, one call to malloc is better than two (although that usually also means the numbers are big and any gmp operation will dwarf allocation). Also, the threshold between alloca and malloc is quite high, and with many separate allocations that all barely fit below this threshold, the total amount of stack memory used can become too large for some applications (lowering the threshold may be easier than allocating things in groups though). I don't buy this argument. If the threshold is high, then surely the malloc time will not take a significant fraction of the total time. If the threshold is too high, then we should lower it. Is there no good range for the threshold? Show me the numbers...! -- Torbj?rn From nisse at lysator.liu.se Wed Feb 22 22:36:39 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Wed, 22 Feb 2012 22:36:39 +0100 Subject: Sandybridge addmul_N challenge In-Reply-To: <86sji2khqo.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Wed, 22 Feb 2012 16:30:55 +0100") References: <868vjxbvxl.fsf@shell.gmplib.org> <86sji2khqo.fsf@shell.gmplib.org> Message-ID: Torbjorn Granlund writes: > The sandybridge machine tom behind shell is waiting for you. A nehalem > machine is biko* (lots of Xen machines on the same hardware) and > repentium is a Conroe ("core2"). The best I find is mov (up, n, 8), %rax mul v mov %rdx, c1 add (rp, n, 8), %rax adc $0, c1 add %rax, c0 adc $0, c1 mov c0, (rp, n, 8) Unrolled four times, that's 34 instructions. The best result from the loop mixer so far has been 3.24 cycles / limb. See shell:~nisse/hack/loopmix/lms/addmul_1-2.nlms. Which is the same as your code, I guess. A variant with one more instruction to move away %rax is at shell:~nisse/hack/loopmix/lms/addmul_1.nlms seems to run at 3.52. A variant with 7 instructions, but two add reg, mem operations also is slow. Regards, /nisse -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From tg at gmplib.org Thu Feb 23 07:51:26 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Thu, 23 Feb 2012 07:51:26 +0100 Subject: Sandybridge addmul_N challenge In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Wed\, 22 Feb 2012 22\:36\:39 +0100") References: <868vjxbvxl.fsf@shell.gmplib.org> <86sji2khqo.fsf@shell.gmplib.org> Message-ID: <86fwe2qbyp.fsf@shell.gmplib.org> nisse at lysator.liu.se (Niels M?ller) writes: The best I find is mov (up, n, 8), %rax mul v mov %rdx, c1 add (rp, n, 8), %rax adc $0, c1 add %rax, c0 adc $0, c1 mov c0, (rp, n, 8) Unrolled four times, that's 34 instructions. The best result from the loop mixer so far has been 3.24 cycles / limb. See shell:~nisse/hack/loopmix/lms/addmul_1-2.nlms. Which is the same as your code, I guess. My code looked like 3.16 in the loopmixer, then runs at 3.25 outside of it. If you can make your smaller code actually run at 3.25, it is an improvement. I think we should focus not on addmul_1 but on mul_basecase, sqr_basecase, redc_1, perhaps redc_2. I.e., please focus on addmul_2 (or addmul_N, N > 2) or vertical multiplication primitives. -- Torbj?rn From nisse at lysator.liu.se Thu Feb 23 09:53:54 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Thu, 23 Feb 2012 09:53:54 +0100 Subject: Sandybridge addmul_N challenge In-Reply-To: <86fwe2qbyp.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Thu, 23 Feb 2012 07:51:26 +0100") References: <868vjxbvxl.fsf@shell.gmplib.org> <86sji2khqo.fsf@shell.gmplib.org> <86fwe2qbyp.fsf@shell.gmplib.org> Message-ID: Torbjorn Granlund writes: > I think we should focus not on addmul_1 but on mul_basecase, > sqr_basecase, redc_1, perhaps redc_2. I.e., please focus on addmul_2 > (or addmul_N, N > 2) or vertical multiplication primitives. Here's a sketch of an adddmul_2 iteration using Karatsuba. I assume we have vl, vh, vd = |vl - vh| and an appropriate sign vmask in registers before the loop. Carry input in c0, c1, carry out in r2, r3. mov (up), %rax mov %rax, ul mul vl C low product mov 8(up), uh mov %rax, r0 mov %rdx, r1 lea (uh, ul), %rax sub uh, ul cmovnc ul, %rax sbb r3, r3 mul vd C Middle product mov r1, r2 C Add shifted low product add r0, r1 adc $0, r2 add (rp), r0 C Add rp limbs adc 8(rp), r1 adc $0, r2 mov %rax, p0 mov %rdx, p1 mov uh, %rax mul vh C High product xor vmask, r3 xor r3, p0 C Conditionally negate, and add, middle product xor r3, p1 bt $0, r3 adc p0, r1 adc p1, r2 adc $0, r3 add %rax, r1 C Add shifted high product adc %rdx, r2 adc $0, r2 add c0, r0 C Add input carry limbs adc c1, r1 mov c0, (rp) mov c1, 8(rp) adc %rax, r2 C Add high product adc %rdx, r3 37 instructions, or 12.25 instructions per limb, excluding looping logic (and it has to be unrolled twice, to use separate registers for input and output carries). I think the instruction count can be reduced a bit, at the cost of higher pressure on the out-of-order execution. * At least the moves to p0, p1 can be eliminated. * One could also save some instructions from adding in c0, c1 earlier, and doing an in-place add to (rp) at the end, on the theory that the recurrency is less tight. * I'm also not sure if the order of the three multiplications is the best one. * I don't try to optimize the add HIGH(ul * vl) + LOW(uh * vh), which (if additions are organized in the right way) is done twice, I suspect it's going to be a bit painful since the carry has to be applied at two places. What do you think? If one can get one iteration to run at 12 cycles, that's 3 c/l and an improvement over addmul_1. If one can get it down to 11 or 11.5, one beats 3 c/l. For a "vertical" mul_basecase, the quadratic work would be an iteration of mov (up), %rax mul (vp) add %rax, r0 adc %rax, r1 adc $0, r3 So there's potential for that to run at 2 cycles per limb product. But then there's also a significant linear cost for accumulation and carrry propagation, and possible bad branch-prediction due to loops of varying lenghts. Regards, /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From tg at gmplib.org Thu Feb 23 11:13:30 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Thu, 23 Feb 2012 11:13:30 +0100 Subject: Sandybridge addmul_N challenge In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Thu\, 23 Feb 2012 09\:53\:54 +0100") References: <868vjxbvxl.fsf@shell.gmplib.org> <86sji2khqo.fsf@shell.gmplib.org> <86fwe2qbyp.fsf@shell.gmplib.org> Message-ID: <86d395kgc5.fsf@shell.gmplib.org> nisse at lysator.liu.se (Niels M?ller) writes: Here's a sketch of an adddmul_2 iteration using Karatsuba. I assume we have vl, vh, vd = |vl - vh| and an appropriate sign vmask in registers before the loop. Carry input in c0, c1, carry out in r2, r3. mov (up), %rax mov %rax, ul mul vl C low product mov 8(up), uh mov %rax, r0 mov %rdx, r1 lea (uh, ul), %rax sub uh, ul cmovnc ul, %rax sbb r3, r3 mul vd C Middle product mov r1, r2 C Add shifted low product add r0, r1 adc $0, r2 add (rp), r0 C Add rp limbs adc 8(rp), r1 adc $0, r2 mov %rax, p0 mov %rdx, p1 mov uh, %rax mul vh C High product xor vmask, r3 xor r3, p0 C Conditionally negate, and add, middle product xor r3, p1 bt $0, r3 adc p0, r1 adc p1, r2 adc $0, r3 add %rax, r1 C Add shifted high product adc %rdx, r2 adc $0, r2 add c0, r0 C Add input carry limbs adc c1, r1 mov c0, (rp) mov c1, 8(rp) adc %rax, r2 C Add high product adc %rdx, r3 37 instructions, or 12.25 instructions per limb, excluding looping logic (and it has to be unrolled twice, to use separate registers for input and output carries). How did you arrive to 12.25 insns/limb? I have not tried to understand the code, but doesn't it perform a 2x2 limb multiply with accumulation? That's 9.25 insn/limb product. I very much doubt this will win for Sandybridge, unless you can decrease the insn count with several instructions. Unfortunately it has no chance on Bull-dozer, since the latter has a 2 issue pipeline; you need to beat 32 insns per 2x2 accumulation block there. I think the instruction count can be reduced a bit, at the cost of higher pressure on the out-of-order execution. Perhaps some of the adc $0 could be eliminated with 2x unrolling? What do you think? If one can get one iteration to run at 12 cycles, that's 3 c/l and an improvement over addmul_1. If one can get it down to 11 or 11.5, one beats 3 c/l. If that is possible, it might not be enough... See below. For a "vertical" mul_basecase, the quadratic work would be an iteration of mov (up), %rax mul (vp) add %rax, r0 adc %rax, r1 adc $0, r3 So there's potential for that to run at 2 cycles per limb product. But then there's also a significant linear cost for accumulation and carrry propagation, and possible bad branch-prediction due to loops of varying lenghts. Exactly. (But the branch misprediction problem would not happen for for David's mulmid_basecase, I suppose.) Some ways to deal with the branch misprediction problem: * Have straight line code for the corners, up to a limit. This gets rid of the really high relative branch misprediction for these small areas. * Handle two (or more) columns in parallel, and separately for the low-significant right triangle, any middle rectangular part, and the left triangle. This doubles (or more) the amount of useful work per branch misprediction. * I suppose that making full use of out-of-order execution just before a mispredicted branch would make sense. I played a bit with mul_2 yesternight. I am not 100% the code is correct, but I think it is. The loopmixer found a 2.5 c/l version of it. I started with genxmul.c (from the loopmixer repo) using these args: "-n2 -w4 --mul". I then analysed the critical path and determined that it is about 24. The problem is adc feeding other adc feeding other adc though a register (remember that pure carry deps are fast on Sandybridge). So I mindlessly introduced 4 new registers, then using alternating accumulation registers. I needed 8 extra insn (in total, corresponding to 2 per way) to pairwise sum accumulation registers. I am sure this was not done optimally, but (assuming my code is sound) it proves that there is a lot of performance headroom, as expected. I conjecture that we could create an addmul_N for Sandybridge that runs at <= 2.5 c/l. I think this will be possible already for N=2. Perhaps we could arrive to 2.25 for N=4, matching the K8-K10 performance. -- Torbj?rn From nisse at lysator.liu.se Thu Feb 23 16:09:56 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Thu, 23 Feb 2012 16:09:56 +0100 Subject: Sandybridge addmul_N challenge In-Reply-To: <86d395kgc5.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Thu, 23 Feb 2012 11:13:30 +0100") References: <868vjxbvxl.fsf@shell.gmplib.org> <86sji2khqo.fsf@shell.gmplib.org> <86fwe2qbyp.fsf@shell.gmplib.org> <86d395kgc5.fsf@shell.gmplib.org> Message-ID: Torbjorn Granlund writes: > How did you arrive to 12.25 insns/limb? I did someting wrong, I guess... 9.25 instructions / limb or 3.1 cycles if we can issue 3 instructions per cycle. > I very much doubt this will win for Sandybridge, unless you can decrease > the insn count with several instructions. One can decrease it a bit by adding c0, c1 earlier (do you think recurrency can be a problem if we add c0, c1 to the first product?) and doing an in-place add to (rp) and 8(rp) at the end. I could get it down to 30 instructions with a deep carry recurrency, or 34 with a short one. I can get neither variant to run faster than 4 c/l. I also had a quick look at doing Karatsuba based on (u0+u1)*(v0+v1). It's about the same number of instructions, but the updates from carries are independent of all products, so there's more more freedom in where to move them around. I think this idea may be more useful for other processors, without the awkward hardwired mul registers. For documentation, this is what the iteration should do: vd = |v1 - v0|, sign in vs (outside loop) ud = |u1 - u0|, sign in us s = us ^ vs ^ 1 = u1 * v1 = ud * vd ^ <-s, -s> = u0 * v0 +-----+-----+ |p3 p2|p1 p0| +-----+--+--+ |c1 c0| +-----+ |r1 r0| +--+--+--+ |p1 p0| +-----+ |p3 p2| +--+--+--+ |-s| 0| s| +--+--+--+ |m1 m0| --+--+-----+--+ |c3 c2 r1 r0| +-----------+ or = v1 + v0 (outside loop) = u1 + u0 = u1 * v1 = us * vs = u0 * v0 +-----+-----+ |p3 p2|p1 p0| +-----+--+--+ |c1 c0| +-----+ |r1 r0| +--+--+--+ - |p1 p0| +-----+ - |p3 p2| +-----+ |m1 m0| +--+--+--+ |vc vs| if uc +--+--+ |us| if vc --+--+--+-----+ |c3 c2 r1 r0| +-----------+ > Perhaps some of the adc $0 could be eliminated with 2x unrolling? In effect, that would be a kind of 4x2 multiply. Which would then be done as two 2x2 (I think the high limbs one get from evaluation rules out using toom32 or toom42). Haven't tried that. I suspect one will run out of registers. > I played a bit with mul_2 yesternight. I am not 100% the code is > correct, but I think it is. The loopmixer found a 2.5 c/l version of > it. Nice. I've now wasted quite some time... It seems really difficult. Now I also tried a very basic variant of addmul_2, doing only one u limb per iteration and multiplying it by the two v limbs. Even if I have a very nice carry recurrence between iterations, add, adc, adc $0, four cycles, and a small number of instructions (15 per iteration, 32 for the twice unrolled loop), which one might think could be executed in 11 cycles or 5.5 / iteration or 2.75 cycles per limb product. But it won't run faster than 6.5 cycles per iteration, or 3.25 c/l. So it just seems very difficult to convince the cpu to really execute the independent operations, outside of the recurrency, in parallel. BTW, are any of the SSE3 etc instructions useful here? Regards, /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From tg at gmplib.org Thu Feb 23 17:44:21 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Thu, 23 Feb 2012 17:44:21 +0100 Subject: Sandybridge addmul_N challenge In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Thu\, 23 Feb 2012 16\:09\:56 +0100") References: <868vjxbvxl.fsf@shell.gmplib.org> <86sji2khqo.fsf@shell.gmplib.org> <86fwe2qbyp.fsf@shell.gmplib.org> <86d395kgc5.fsf@shell.gmplib.org> Message-ID: <868vjtijoa.fsf@shell.gmplib.org> nisse at lysator.liu.se (Niels M?ller) writes: One can decrease it a bit by adding c0, c1 earlier (do you think recurrency can be a problem if we add c0, c1 to the first product?) and doing an in-place add to (rp) and 8(rp) at the end. I could get it down to 30 instructions with a deep carry recurrency, or 34 with a short one. I can get neither variant to run faster than 4 c/l. In loopmixer or manually? I wouldn't draw any conclusions without mixing the code first... I also had a quick look at doing Karatsuba based on (u0+u1)*(v0+v1). Meaning evaluating in +1 instead of -1, I assume. It's about the same number of instructions, but the updates from carries are independent of all products, so there's more more freedom in where to move them around. I think this idea may be more useful for other processors, without the awkward hardwired mul registers. True. > I played a bit with mul_2 yesternight. I am not 100% the code is > correct, but I think it is. The loopmixer found a 2.5 c/l version of > it. Nice. I've now wasted quite some time... It seems really difficult. It is challenging, but I am getting convinced we can really speed things a lot. Now I also tried a very basic variant of addmul_2, doing only one u limb per iteration and multiplying it by the two v limbs. Even if I have a very nice carry recurrence between iterations, add, adc, adc $0, four cycles, and a small number of instructions (15 per iteration, 32 for the twice unrolled loop), which one might think could be executed in 11 cycles or 5.5 / iteration or 2.75 cycles per limb product. But it won't run faster than 6.5 cycles per iteration, or 3.25 c/l. So it just seems very difficult to convince the cpu to really execute the independent operations, outside of the recurrency, in parallel. Did you compute the recurrency chain? Annotating the instructions on the recurrency chain helps understanding the problem. My experience of Sandybridge is that with load/store coding style, the CPU typically executes 3 insn/cycle unless there is a recurrency dependency stopping that. BTW, are any of the SSE3 etc instructions useful here? I don't think there are. These instructions are mostly FP plus narrow integer ops. -- Torbj?rn From nisse at lysator.liu.se Thu Feb 23 21:00:41 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Thu, 23 Feb 2012 21:00:41 +0100 Subject: Sandybridge addmul_N challenge In-Reply-To: <868vjtijoa.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Thu, 23 Feb 2012 17:44:21 +0100") References: <868vjxbvxl.fsf@shell.gmplib.org> <86sji2khqo.fsf@shell.gmplib.org> <86fwe2qbyp.fsf@shell.gmplib.org> <86d395kgc5.fsf@shell.gmplib.org> <868vjtijoa.fsf@shell.gmplib.org> Message-ID: Torbjorn Granlund writes: > In loopmixer or manually? I wouldn't draw any conclusions without > mixing the code first... With the loop mixer. > Meaning evaluating in +1 instead of -1, I assume. Exactly. > Did you compute the recurrency chain? Annotating the instructions on > the recurrency chain helps understanding the problem. I tried. I use this iteration mov (up, n, 8), %rax mov %rax, u mul v0 mov %rax, r0 mov %rdx, r1 mov u, %rax mul v1 mov (rp, n, 8), t add t, r0 add %rax, r1 mov %rdx, r2 adc $0, r2 add c0, r0 mov r0, (rp, n, 8) adc c1, r1 adc $0, r2 For the recurrency, the inputs are c0, c1, and the outputs are r1, r2. Let's write the interesting instructions out and unroll twice (using different registers), add c0, r0 C 0 6 adc c1, r1 C 1 7 adc $0, r2 C 3 9 add r1, c2 C 3 9 adc r2, c0 C 4 10 adc $0, c1 C 6 12 So the recurrency, for one iteration, seems to be just 3 cycles. But the loop mixer doesn't find anything faster then 6.36 cycles for one iteration, or 3.18 per limb product. Which isn't too bad (a slight improvement over 3.24, which I think is the best reported earlier), but stubbornly above 3 c/l. > My experience of Sandybridge is that with load/store coding style, the > CPU typically executes 3 insn/cycle unless there is a recurrency > dependency stopping that. If we could get there, the above loop should run just below 3 c/l. > I don't think there are. These instructions are mostly FP plus narrow > integer ops. Hmm. Last time I looked at that was in a 32-bit context. There's a 32x32->64 instruction which might be useful for a 32-bit build, at least in theory, but as far as I can find in the manual, the latest ss*-extensions don't provide any wider multiplication than that. Regards, /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From nisse at lysator.liu.se Thu Feb 23 21:09:32 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Thu, 23 Feb 2012 21:09:32 +0100 Subject: Sandybridge addmul_N challenge In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Thu, 23 Feb 2012 21:00:41 +0100") References: <868vjxbvxl.fsf@shell.gmplib.org> <86sji2khqo.fsf@shell.gmplib.org> <86fwe2qbyp.fsf@shell.gmplib.org> <86d395kgc5.fsf@shell.gmplib.org> <868vjtijoa.fsf@shell.gmplib.org> Message-ID: nisse at lysator.liu.se (Niels M?ller) writes: > So the recurrency, for one iteration, seems to be just 3 cycles. But the > loop mixer doesn't find anything faster then 6.36 cycles for one > iteration, or 3.18 per limb product. Which isn't too bad (a slight > improvement over 3.24, which I think is the best reported earlier), but > stubbornly above 3 c/l. One update. I have now tried unrolling four times. Then I've seen one sequence running at 6.16 cycles per iteration, or 3.08 c/l. See shell:~nisse/hack/loopmix/lms/addmul_2-nisse-2.nlms. Regards, /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From tg at gmplib.org Thu Feb 23 21:38:27 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Thu, 23 Feb 2012 21:38:27 +0100 Subject: Sandybridge addmul_N challenge In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Thu\, 23 Feb 2012 21\:00\:41 +0100") References: <868vjxbvxl.fsf@shell.gmplib.org> <86sji2khqo.fsf@shell.gmplib.org> <86fwe2qbyp.fsf@shell.gmplib.org> <86d395kgc5.fsf@shell.gmplib.org> <868vjtijoa.fsf@shell.gmplib.org> Message-ID: <864nuhp9oc.fsf@shell.gmplib.org> nisse at lysator.liu.se (Niels M?ller) writes: For the recurrency, the inputs are c0, c1, and the outputs are r1, r2. Let's write the interesting instructions out and unroll twice (using different registers), add c0, r0 C 0 6 adc c1, r1 C 1 7 adc $0, r2 C 3 9 add r1, c2 C 3 9 adc r2, c0 C 4 10 adc $0, c1 C 6 12 So the recurrency, for one iteration, seems to be just 3 cycles. But the loop mixer doesn't find anything faster then 6.36 cycles for one iteration, or 3.18 per limb product. Which isn't too bad (a slight improvement over 3.24, which I think is the best reported earlier), but stubbornly above 3 c/l. I am playing with this block: carry-in lo in r14 carry-in hi in rcx mov 0(up), %rax mul v1 mov 8(rp), %r8 add %rax, %r8 mov %rdx, %r9 adc $0, %r9 mov 8(up), %rax mul v0 add %rax, %r8 adc %rdx, %r9 mov $0, R32(%rbx) adc $0, R32(%rbx) add %r14, %r8 C 0 adc %rcx, %r9 C 1 adc $0, R32(%rbx) C might be removed mov %r8, 8(rp) carry-out lo in r9 carry-out hi in rbx This is not identical to your block, I think. It runs at exactly 3 c/l. The recurrency path is extremely shallow, at 1.5 c/l. If we slightly restrict the operand range, we could remove the indicated carry propagation insn. Then the code runs at 2.8 c/l. Neither is decoding bandwidth limited, It is further possible to supplant the 'mov $0,reg' and following 'adc $0,reg' with 'setc reg'. This creates a false dependency (on the upper 56 bits) and seems to run at about the same speed. The plain code (i.e., the code which runs at 3.0 c/l) runs at 3-epsilon if the lea pointer update insns are removed. This is a good sign, proving there is no magic stopping us at 3 c/l... > My experience of Sandybridge is that with load/store coding style, the > CPU typically executes 3 insn/cycle unless there is a recurrency > dependency stopping that. If we could get there, the above loop should run just below 3 c/l. I was obviously wrong. :-( Hmm. Last time I looked at that was in a 32-bit context. There's a 32x32->64 instruction which might be useful for a 32-bit build, at least in theory, but as far as I can find in the manual, the latest ss*-extensions don't provide any wider multiplication than that. I believe that insn is used for 32-bit builds, where it helps. Much improvments could be done for 32-bit builds, if one care (I see a new mail has arrived, will read now.) -- Torbj?rn From tg at gmplib.org Thu Feb 23 22:30:51 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Thu, 23 Feb 2012 22:30:51 +0100 Subject: Sandybridge addmul_N challenge In-Reply-To: <864nuhp9oc.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Thu\, 23 Feb 2012 21\:38\:27 +0100") References: <868vjxbvxl.fsf@shell.gmplib.org> <86sji2khqo.fsf@shell.gmplib.org> <86fwe2qbyp.fsf@shell.gmplib.org> <86d395kgc5.fsf@shell.gmplib.org> <868vjtijoa.fsf@shell.gmplib.org> <864nuhp9oc.fsf@shell.gmplib.org> Message-ID: <86y5rtnsok.fsf@shell.gmplib.org> Torbjorn Granlund writes: If we slightly restrict the operand range, we could remove the indicated carry propagation insn. Wrong. Those carry propagation insns are needed. -- Torbj?rn From bodrato at mail.dm.unipi.it Fri Feb 24 09:21:46 2012 From: bodrato at mail.dm.unipi.it (bodrato at mail.dm.unipi.it) Date: Fri, 24 Feb 2012 09:21:46 +0100 (CET) Subject: _mp_alloc vs ALLOC In-Reply-To: <86zkcar7el.fsf@shell.gmplib.org> References: <86d396soc2.fsf@shell.gmplib.org> <49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it> <86zkcar7el.fsf@shell.gmplib.org> Message-ID: <49731.151.32.246.62.1330071706.squirrel@mail.dm.unipi.it> Il Mer, 22 Febbraio 2012 8:32 pm, Torbjorn Granlund ha scritto: > bodrato at mail.dm.unipi.it writes: > Unrelated :-) We might define more macros like TMP_ALLOC_LIMBS_2 . I > Please look at the generated code from TMP_ALLOC from any reasonable > compiler. It is a sub from sp, the a copy from sp to the target > variable. Cost: about 1 cycle. sal $3, %n cmpl $65535, %n ja .Lunlikelybranch add $30, %n and $-16, %n sub %n, %esp .Lunlikelybranchreturnshere > TMP_ALLOC_LIMBS_2 is clutter IMHO. -- http://bodrato.it/ From bodrato at mail.dm.unipi.it Fri Feb 24 09:23:36 2012 From: bodrato at mail.dm.unipi.it (bodrato at mail.dm.unipi.it) Date: Fri, 24 Feb 2012 09:23:36 +0100 (CET) Subject: _mp_alloc vs ALLOC In-Reply-To: <49731.151.32.246.62.1330071707.squirrel@mail.dm.unipi.it> References: <86d396soc2.fsf@shell.gmplib.org> <49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it> <86zkcar7el.fsf@shell.gmplib.org> <49731.151.32.246.62.1330071707.squirrel@mail.dm.unipi.it> Message-ID: <49732.151.32.246.62.1330071816.squirrel@mail.dm.unipi.it> Sorry, I sent the previous mail by mistake... -- http://bodrato.it/ From nisse at lysator.liu.se Fri Feb 24 10:01:23 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Fri, 24 Feb 2012 10:01:23 +0100 Subject: _mp_alloc vs ALLOC In-Reply-To: <86pqd6r5zk.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Wed, 22 Feb 2012 21:02:55 +0100") References: <86d396soc2.fsf@shell.gmplib.org> <49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it> <86zkcar7el.fsf@shell.gmplib.org> <86pqd6r5zk.fsf@shell.gmplib.org> Message-ID: Torbjorn Granlund writes: > Surely a plain TMP_ALLOC adds red zones? If not, that is something we > ought to fix. But tp = TMP_ALLOC_LIMBS (2*n); xp = tp + n; does not add any between T and X (intended to hold n limbs each). So if one doesn't use TMP_ALLOC_LIMBS_2, one should instead write tp = TMP_ALLOC_LIMBS (n); xp = TMP_ALLOC_LIMBS (n); to get red zones for this common case. Right? Maybe this is more overhead, in the non-debug case, than using TMP_ALLOC_LIMBS_2. I'm not sure. Regards, /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From tg at gmplib.org Fri Feb 24 10:11:33 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Fri, 24 Feb 2012 10:11:33 +0100 Subject: _mp_alloc vs ALLOC In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Fri\, 24 Feb 2012 10\:01\:23 +0100") References: <86d396soc2.fsf@shell.gmplib.org> <49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it> <86zkcar7el.fsf@shell.gmplib.org> <86pqd6r5zk.fsf@shell.gmplib.org> Message-ID: <861upkioje.fsf@shell.gmplib.org> nisse at lysator.liu.se (Niels M?ller) writes: Torbjorn Granlund writes: > Surely a plain TMP_ALLOC adds red zones? If not, that is something we > ought to fix. But tp = TMP_ALLOC_LIMBS (2*n); xp = tp + n; does not add any between T and X (intended to hold n limbs each). So if one doesn't use TMP_ALLOC_LIMBS_2, one should instead write tp = TMP_ALLOC_LIMBS (n); xp = TMP_ALLOC_LIMBS (n); to get red zones for this common case. Right? Maybe this is more overhead, in the non-debug case, than using TMP_ALLOC_LIMBS_2. I'm not sure. TMP_ALLOC_LIMBS_2 makes two TMP_ALLOC_LIMBS calls if WANT_TMP_DEBUG, else one. So the red zones will be there when we want them. I think the conclusion is that TMP_ALLOC_LIMBS_2 could save some cycles by collapsing two malloc calls into one, when allocating large blocks using TMP_ALLOC (as opposed to TMP_BALLOC or TMP_SALLOC). My idea is that these cycles saved are unimportant, following GMP's founding principle of relative overhead: Adding a million cycles to a billion cycle computation does not matter, but adding 1 cycle to a 10 cycle computation is unforgivable. -- Torbj?rn From nisse at lysator.liu.se Fri Feb 24 10:27:17 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Fri, 24 Feb 2012 10:27:17 +0100 Subject: _mp_alloc vs ALLOC In-Reply-To: <861upkioje.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Fri, 24 Feb 2012 10:11:33 +0100") References: <86d396soc2.fsf@shell.gmplib.org> <49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it> <86zkcar7el.fsf@shell.gmplib.org> <86pqd6r5zk.fsf@shell.gmplib.org> <861upkioje.fsf@shell.gmplib.org> Message-ID: Torbjorn Granlund writes: > I think the conclusion is that TMP_ALLOC_LIMBS_2 could save some cycles > by collapsing two malloc calls into one, when allocating large blocks > using TMP_ALLOC (as opposed to TMP_BALLOC or TMP_SALLOC). What about the test in #define TMP_ALLOC(n) \ (LIKELY ((n) < 65536) ? TMP_SALLOC(n) : TMP_BALLOC(n)) That test will cost a cycle or two for each TMP_ALLOC call (with non-constant n), regardless of size, won't it? Regards, /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From tg at gmplib.org Fri Feb 24 10:40:00 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Fri, 24 Feb 2012 10:40:00 +0100 Subject: _mp_alloc vs ALLOC In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Fri\, 24 Feb 2012 10\:27\:17 +0100") References: <86d396soc2.fsf@shell.gmplib.org> <49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it> <86zkcar7el.fsf@shell.gmplib.org> <86pqd6r5zk.fsf@shell.gmplib.org> <861upkioje.fsf@shell.gmplib.org> Message-ID: <86r4xkh8nj.fsf@shell.gmplib.org> nisse at lysator.liu.se (Niels M?ller) writes: What about the test in #define TMP_ALLOC(n) \ (LIKELY ((n) < 65536) ? TMP_SALLOC(n) : TMP_BALLOC(n)) That test will cost a cycle or two for each TMP_ALLOC call (with non-constant n), regardless of size, won't it? I think my previous statement "1 cycle" should be amended to "2 cycles". A correctly predicted compare-and-branch cost 1-2 cycles, with a throughput of 1 per cycle (on any modern machine). The allocation code will run in parallel with the branch (assuming again correct prediction). I cannot see how TMP_ALLOC_LIMBS_2 could save *anything* for small allocations, since it basically performs the same operations. I.e., the net cost of splitting TMP_ALLOC_LIMBS_2 into two TMP_ALLOC_LIMBS is 0. But it might be +-1 depending on alignment and all sorts of magic. -- Torbj?rn From tg at gmplib.org Fri Feb 24 10:55:47 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Fri, 24 Feb 2012 10:55:47 +0100 Subject: Division call in mpn_gcd Message-ID: <86ipiwh7x8.fsf@shell.gmplib.org> Inspired by Marc{,o}'s cleanup commits, I decided to look for TMP_ALLOC* calls that should be made into TMP_SALLOC* or TMP_BALLOC*. Then I spotted this unrelated thing: tp = TMP_ALLOC_LIMBS(talloc); if (usize > n) { mpn_tdiv_qr (tp, up, 0, up, usize, vp, n); if (mpn_zero_p (up, n)) { MPN_COPY (gp, vp, n); ctx.gn = n; goto done; } } Why is mpn_tdiv_qr used here, the quotient should be irrelevent? I'd say to use mpn_bdiv_qr instead, to streamline things (followed by a right shift to get rid of low zeros)? After all, if g=gcd(a,b) then g | a and g | b, and g | (a + b*c) for any a,b,c in Z. -- Torbj?rn From bodrato at mail.dm.unipi.it Fri Feb 24 11:08:37 2012 From: bodrato at mail.dm.unipi.it (bodrato at mail.dm.unipi.it) Date: Fri, 24 Feb 2012 11:08:37 +0100 (CET) Subject: _mp_alloc vs ALLOC In-Reply-To: <86r4xkh8nj.fsf@shell.gmplib.org> References: <86d396soc2.fsf@shell.gmplib.org> <49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it> <86zkcar7el.fsf@shell.gmplib.org> <86pqd6r5zk.fsf@shell.gmplib.org> <861upkioje.fsf@shell.gmplib.org> <86r4xkh8nj.fsf@shell.gmplib.org> Message-ID: <49998.151.32.246.62.1330078117.squirrel@mail.dm.unipi.it> Ciao, Il Ven, 24 Febbraio 2012 10:40 am, Torbjorn Granlund ha scritto: > I cannot see how TMP_ALLOC_LIMBS_2 could save *anything* for small I'm not sure I agree with Torbjorn. Nevertheless developers time is a far more precious resource than a few cpu cycles or bytes of code size... That's why I completely change my question, always about allocations. I always used the MPZ_REALLOC macro, to enlarge (if needed) the memory area available for an integer. This macro gives a (possibly new) pointer with the requested size available... but it also copies the content. Sometimes I know in advance that the content can be discarded. Is there a standard way to ensure a given size without moving data? Regards, m -- http://bodrato.it/papers/ From tg at gmplib.org Fri Feb 24 11:32:37 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Fri, 24 Feb 2012 11:32:37 +0100 Subject: _mp_alloc vs ALLOC In-Reply-To: <49998.151.32.246.62.1330078117.squirrel@mail.dm.unipi.it> (bodrato@mail.dm.unipi.it's message of "Fri\, 24 Feb 2012 11\:08\:37 +0100 \(CET\)") References: <86d396soc2.fsf@shell.gmplib.org> <49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it> <86zkcar7el.fsf@shell.gmplib.org> <86pqd6r5zk.fsf@shell.gmplib.org> <861upkioje.fsf@shell.gmplib.org> <86r4xkh8nj.fsf@shell.gmplib.org> <49998.151.32.246.62.1330078117.squirrel@mail.dm.unipi.it> Message-ID: <86ehtkh67u.fsf@shell.gmplib.org> bodrato at mail.dm.unipi.it writes: Ciao, Il Ven, 24 Febbraio 2012 10:40 am, Torbjorn Granlund ha scritto: > I cannot see how TMP_ALLOC_LIMBS_2 could save *anything* for small I'm not sure I agree with Torbjorn. Nevertheless developers time is a far more precious resource than a few cpu cycles or bytes of code size... That's why I completely change my question, always about allocations. I always used the MPZ_REALLOC macro, to enlarge (if needed) the memory area available for an integer. This macro gives a (possibly new) pointer with the requested size available... but it also copies the content. Sometimes I know in advance that the content can be discarded. Is there a standard way to ensure a given size without moving data? I use this trick for that: rp = realloc (rp, 1); rp = realloc (rp, newsize); I suspect there is a lot to win from using such a trick, at least of the old size was large. But perhaps some malloc implementations are so slow for finding a 1-byte block that it can also become slower? (It is a pity the C library provide only very primitive allocation functions.) I am not sure how to deal with this in GMP. We could add a flag field in MPZ_REALLOC, or have special functions+macros. Unfortunatly, we are somewhat constrained by the replaceable allocation functions; changing the __gmp_reallocate_func type will break compatibility with user code. We will therefore need to make two realloc calls by invoking the fairly high-level __gmp_reallocate_func twice. Oh well. -- Torbj?rn From bodrato at mail.dm.unipi.it Fri Feb 24 11:41:10 2012 From: bodrato at mail.dm.unipi.it (bodrato at mail.dm.unipi.it) Date: Fri, 24 Feb 2012 11:41:10 +0100 (CET) Subject: _mp_alloc vs ALLOC In-Reply-To: <86ehtkh67u.fsf@shell.gmplib.org> References: <86d396soc2.fsf@shell.gmplib.org> <49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it> <86zkcar7el.fsf@shell.gmplib.org> <86pqd6r5zk.fsf@shell.gmplib.org> <861upkioje.fsf@shell.gmplib.org> <86r4xkh8nj.fsf@shell.gmplib.org> <49998.151.32.246.62.1330078117.squirrel@mail.dm.unipi.it> <86ehtkh67u.fsf@shell.gmplib.org> Message-ID: <50044.151.32.246.62.1330080070.squirrel@mail.dm.unipi.it> Ciao, Il Ven, 24 Febbraio 2012 11:32 am, Torbjorn Granlund ha scritto: > bodrato at mail.dm.unipi.it writes: > I always used the MPZ_REALLOC macro, to enlarge (if needed) the memory > area available for an integer. This macro gives a (possibly new) pointer > with the requested size available... but it also copies the content. > > Sometimes I know in advance that the content can be discarded. Is there > a standard way to ensure a given size without moving data? > > I use this trick for that: > > rp = realloc (rp, 1); > rp = realloc (rp, newsize); Inspired from mpz/mul.c... Maybe we can write a macro based on: (*__gmp_free_func) (PTR(x), ALLOC (x) * BYTES_PER_MP_LIMB); ALLOC (x) = newsize; PTR(x) = (mp_ptr) (*__gmp_allocate_func) (newsize * BYTES_PER_MP_LIMB); ? Regards, m -- http://bodrato.it/ From nisse at lysator.liu.se Fri Feb 24 11:44:39 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Fri, 24 Feb 2012 11:44:39 +0100 Subject: Division call in mpn_gcd In-Reply-To: <86ipiwh7x8.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Fri, 24 Feb 2012 10:55:47 +0100") References: <86ipiwh7x8.fsf@shell.gmplib.org> Message-ID: Torbjorn Granlund writes: > if (usize > n) > { > mpn_tdiv_qr (tp, up, 0, up, usize, vp, n); > Why is mpn_tdiv_qr used here, the quotient should be irrelevent? You're right that the quotient is not wanted, mpn_div_r would make more sense, but that function doesn't exist. > I'd say to use mpn_bdiv_qr instead, to streamline things (followed by > a right shift to get rid of low zeros)? We don't require that v is odd (maybe it was a mistake to drop that requirement?). So to use any bdiv funtions, we'd first have to deal with powers of two upfront. And the quotient is still not needed, so I think one would want to use bdiv_r, aka redc. So I think the right method is: 1. In case both u and v are even, do needed book-keeping for the power of two in the gcd. 2. Drop trailing zeros of v. 3. Reduce the size of u using redc. Unlike the use in powm, there's no amortization of the inverse computation, so we may need new thresholds. Regards, /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From tg at gmplib.org Fri Feb 24 11:47:43 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Fri, 24 Feb 2012 11:47:43 +0100 Subject: _mp_alloc vs ALLOC In-Reply-To: <50044.151.32.246.62.1330080070.squirrel@mail.dm.unipi.it> (bodrato@mail.dm.unipi.it's message of "Fri\, 24 Feb 2012 11\:41\:10 +0100 \(CET\)") References: <86d396soc2.fsf@shell.gmplib.org> <49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it> <86zkcar7el.fsf@shell.gmplib.org> <86pqd6r5zk.fsf@shell.gmplib.org> <861upkioje.fsf@shell.gmplib.org> <86r4xkh8nj.fsf@shell.gmplib.org> <49998.151.32.246.62.1330078117.squirrel@mail.dm.unipi.it> <86ehtkh67u.fsf@shell.gmplib.org> <50044.151.32.246.62.1330080070.squirrel@mail.dm.unipi.it> Message-ID: <868vjsh5io.fsf@shell.gmplib.org> bodrato at mail.dm.unipi.it writes: Inspired from mpz/mul.c... Maybe we can write a macro based on: (*__gmp_free_func) (PTR(x), ALLOC (x) * BYTES_PER_MP_LIMB); ALLOC (x) = newsize; PTR(x) = (mp_ptr) (*__gmp_allocate_func) (newsize * BYTES_PER_MP_LIMB); That's another possibility. -- Torbj?rn From nisse at lysator.liu.se Fri Feb 24 12:26:20 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Fri, 24 Feb 2012 12:26:20 +0100 Subject: _mp_alloc vs ALLOC In-Reply-To: <50044.151.32.246.62.1330080070.squirrel@mail.dm.unipi.it> (bodrato@mail.dm.unipi.it's message of "Fri, 24 Feb 2012 11:41:10 +0100 (CET)") References: <86d396soc2.fsf@shell.gmplib.org> <49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it> <86zkcar7el.fsf@shell.gmplib.org> <86pqd6r5zk.fsf@shell.gmplib.org> <861upkioje.fsf@shell.gmplib.org> <86r4xkh8nj.fsf@shell.gmplib.org> <49998.151.32.246.62.1330078117.squirrel@mail.dm.unipi.it> <86ehtkh67u.fsf@shell.gmplib.org> <50044.151.32.246.62.1330080070.squirrel@mail.dm.unipi.it> Message-ID: bodrato at mail.dm.unipi.it writes: > Inspired from mpz/mul.c... Maybe we can write a macro based on: > > (*__gmp_free_func) (PTR(x), ALLOC (x) * BYTES_PER_MP_LIMB); > ALLOC (x) = newsize; > PTR(x) = (mp_ptr) (*__gmp_allocate_func) (newsize * BYTES_PER_MP_LIMB); That's going to be more expensive in the case that the allocator could grow the block in place. I imagine that's a likely case when the new size is just slightly larger than the old size. Regards, /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From tg at gmplib.org Fri Feb 24 13:27:36 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Fri, 24 Feb 2012 13:27:36 +0100 Subject: Division call in mpn_gcd In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Fri\, 24 Feb 2012 11\:44\:39 +0100") References: <86ipiwh7x8.fsf@shell.gmplib.org> Message-ID: <864nugh0w7.fsf@shell.gmplib.org> nisse at lysator.liu.se (Niels M?ller) writes: You're right that the quotient is not wanted, mpn_div_r would make more sense, but that function doesn't exist. Indeed it doesn't. Nor does bdiv_r (for independent dividend and divisor sizes). We don't require that v is odd (maybe it was a mistake to drop that requirement?). So to use any bdiv funtions, we'd first have to deal with powers of two upfront. And the quotient is still not needed, so I think one would want to use bdiv_r, aka redc. Except that redc does not accept independent sizes. Therefore we need to use bdiv_qr. So I think the right method is: 1. In case both u and v are even, do needed book-keeping for the power of two in the gcd. 2. Drop trailing zeros of v. 3. Reduce the size of u using redc. Unlike the use in powm, there's no amortization of the inverse computation, so we may need new thresholds. Or perhaps slightly differently: 1. Drop trailing zeros of v, keep the count as vcnt 2. If u is even, drop its trailing zeros (might be lazy about that, only dropping <= vcnt zeros to save time in this part and keep u > v (approximatively u > v, I know this is not the precice mpn_gcd args criteria) but I think that's sub-optimal. 3. Now, if u < v swap them (this can only happen if we dropped all u zero bits). 4. Your point 3. Which thresholds are you talking about? -- Torbj?rn From nisse at lysator.liu.se Fri Feb 24 14:28:09 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Fri, 24 Feb 2012 14:28:09 +0100 Subject: Division call in mpn_gcd In-Reply-To: <864nugh0w7.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Fri, 24 Feb 2012 13:27:36 +0100") References: <86ipiwh7x8.fsf@shell.gmplib.org> <864nugh0w7.fsf@shell.gmplib.org> Message-ID: Torbjorn Granlund writes: > Except that redc does not accept independent sizes. Therefore we need > to use bdiv_qr. I think it should be easy to generalize at least redc_1 and redc_2 to accept independent sizes. Might require some more work for redc_n, which would need some kind of block-wise processing. > 4. Your point 3. Which thresholds are you talking about? REDC_1_TO_REDC_2_THRESHOLD REDC_2_TO_REDC_N_THRESHOLD Using bdiv_qr is surely an improvement, but I think we really ought to have some division function which doesn't require storage for the unwanted quotient. Should that function be bdiv_r, or redc, or even div_r? IIRC, bdiv and redc have slightly different notions on what the "remainder" is, but I imagine either variant would fine for the gcd reduction. When un >> vn, gcd(u, v) shouldn't need O(un) scratch space (btw, this calls for an initial reduction also in mpz_gcd). There are a couple of other divisions in the gcd code where the quotient is similarly unwanted: The initial division in mpn_gcdext, and the (unlikely) division in mpn_gcd_subdiv_step. All these divisions have the property that it doesn't really mattter if the remainder is a few bits too large, so any final adjustment step can be omitted. Regards, /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From tg at gmplib.org Fri Feb 24 14:51:00 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Fri, 24 Feb 2012 14:51:00 +0100 Subject: Division call in mpn_gcd In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Fri\, 24 Feb 2012 14\:28\:09 +0100") References: <86ipiwh7x8.fsf@shell.gmplib.org> <864nugh0w7.fsf@shell.gmplib.org> Message-ID: <86y5rsfigr.fsf@shell.gmplib.org> nisse at lysator.liu.se (Niels M?ller) writes: Torbjorn Granlund writes: > Except that redc does not accept independent sizes. Therefore we need > to use bdiv_qr. I think it should be easy to generalize at least redc_1 and redc_2 to accept independent sizes. Might require some more work for redc_n, which would need some kind of block-wise processing. For some reason I have forgotten, we decided to stick to their current definitions. I suppose we'd basically just need to modify the outer loop count, and perhaps handle carry-out from addmul_N differently. > 4. Your point 3. Which thresholds are you talking about? REDC_1_TO_REDC_2_THRESHOLD REDC_2_TO_REDC_N_THRESHOLD I'd say REDC_1_TO_REDC_2_THRESHOLD is more-or-less a plain comparison if the speed of addmul_1 vs addmul_2, and the inversion costs. They are measured for one size operand, using two size operands ought to mean that sqrt(un*vn) should be compared to REDC_1_TO_REDC_2_THRESHOLD. (I.e., un*vn should be compared o REDC_1_TO_REDC_2_THRESHOLD^2.) This reasoning disregards the constant term of the speed of an addmul_1 or addmul_2 invocation. Using bdiv_qr is surely an improvement, but I think we really ought to have some division function which doesn't require storage for the unwanted quotient. Should that function be bdiv_r, or redc, or even div_r? I suppose the gcd functions are quite tolerant, at least for an initial reduction like this one. IIRC, bdiv and redc have slightly different notions on what the "remainder" is, but I imagine either variant would fine for the gcd reduction. Perhaps this is the reason for keeping redc separate? When un >> vn, gcd(u, v) shouldn't need O(un) scratch space (btw, this calls for an initial reduction also in mpz_gcd). Makes sense. There are a couple of other divisions in the gcd code where the quotient is similarly unwanted: The initial division in mpn_gcdext, Really? Doesn't that quotient affect the cofactors? I don't think we need to finalise redc vs (b)div_r before we modify the gcd code to use hensel division; If tdiv_qr is good enough now, bdiv_qr ought to be good enough to be worth as a separate improvement. Sure, it is cleaner to keep the quotient out-of-the-way (and it will very slightly simplify the scratch space allocations). -- Torbj?rn From nisse at lysator.liu.se Fri Feb 24 14:55:36 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Fri, 24 Feb 2012 14:55:36 +0100 Subject: mpn_mul_2c Message-ID: Here's a patch adding a new function mpn_mul_2c. Like mpn_mul_2, but accepting an single-limb input carry. I'd like to have it (and also mpn_addmul_2c) for generating diagonal terms in sqr_basecase, but there may be other uses. In the x86_64 assembly, I was tempted to move the initial multiplication earlier, but when I tried I made mpn_mul_2 run a cycle slower (problem is that n_param is in %rdx which collides with the multiplication). Instead I had to duplicate the code for selecting the loop entrypoint, and leave the old mul_2 code path unchanged. Added support in devel/try.c, but there are no other testcases. Comments appreciated. Regards, /Niels -------------- next part -------------- A non-text attachment was scrubbed... Name: mul_2c.patch Type: text/x-patch Size: 8330 bytes Desc: not available URL: -------------- next part -------------- -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From nisse at lysator.liu.se Fri Feb 24 15:10:38 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Fri, 24 Feb 2012 15:10:38 +0100 Subject: Division call in mpn_gcd In-Reply-To: <86y5rsfigr.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Fri, 24 Feb 2012 14:51:00 +0100") References: <86ipiwh7x8.fsf@shell.gmplib.org> <864nugh0w7.fsf@shell.gmplib.org> <86y5rsfigr.fsf@shell.gmplib.org> Message-ID: Torbjorn Granlund writes: > nisse at lysator.liu.se (Niels M?ller) writes: > > IIRC, bdiv and redc have slightly different notions on what the > "remainder" is, but I imagine either variant would fine for the gcd > reduction. > > Perhaps this is the reason for keeping redc separate? IIRC, bdiv functions return a borrow, meaning that the remainder corresponding to the computed quotient is negative, while red returns a carry which means that the computed remainder is a bit too large. And then the questions was if a remainder-only function should follow the redc convention, since that's the most important use, or the bdiv_qr convention, for consistency. > There are a couple of other divisions in the gcd code where the quotient > is similarly unwanted: The initial division in mpn_gcdext, > > Really? Doesn't that quotient affect the cofactors? It affects one of the cofactors: the one which we're not going to return. > If tdiv_qr is good enough now, bdiv_qr ought to be good enough to be > worth as a separate improvement. I agree, but I can't promise I'll get around to do that soon. Regards, /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From tg at gmplib.org Sat Feb 25 14:00:54 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Sat, 25 Feb 2012 14:00:54 +0100 Subject: Division call in mpn_gcd In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Fri\, 24 Feb 2012 15\:10\:38 +0100") References: <86ipiwh7x8.fsf@shell.gmplib.org> <864nugh0w7.fsf@shell.gmplib.org> <86y5rsfigr.fsf@shell.gmplib.org> Message-ID: <86obsnkqyh.fsf@shell.gmplib.org> nisse at lysator.liu.se (Niels M?ller) writes: > Perhaps this is the reason for keeping redc separate? IIRC, bdiv functions return a borrow, meaning that the remainder corresponding to the computed quotient is negative, while red returns a carry which means that the computed remainder is a bit too large. That redc behaviour is just one week old... And then the questions was if a remainder-only function should follow the redc convention, since that's the most important use, or the bdiv_qr convention, for consistency. And we shouldn't sacrifice speed for consistency, at the lowest mpn level. > Really? Doesn't that quotient affect the cofactors? It affects one of the cofactors: the one which we're not going to return. I see. I suppose that means the caller that really wants the cofactor should performs this initial (Hensel) division, for efficiency. -- Torbj?rn From tg at gmplib.org Sat Feb 25 23:03:35 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Sat, 25 Feb 2012 23:03:35 +0100 Subject: mpn_mul_2c In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Fri\, 24 Feb 2012 14\:55\:36 +0100") References: Message-ID: <86y5rqfu4o.fsf@shell.gmplib.org> nisse at lysator.liu.se (Niels M?ller) writes: Here's a patch adding a new function mpn_mul_2c. Like mpn_mul_2, but accepting an single-limb input carry. I'd like to have it (and also mpn_addmul_2c) for generating diagonal terms in sqr_basecase, but there may be other uses. Are you rewriting x86_64 sqr_basecase with calls to mul_2? If that's faster than the present code, then I think a version with these mul_2c inlined will be even better. Or is this experimental stuff? In that case, are there reasons to expect an x86_64 mul_2c to be actually used? What for? In the x86_64 assembly, I was tempted to move the initial multiplication earlier, but when I tried I made mpn_mul_2 run a cycle slower (problem is that n_param is in %rdx which collides with the multiplication). Instead I had to duplicate the code for selecting the loop entrypoint, and leave the old mul_2 code path unchanged. That's life. I too have done that a few times. Added support in devel/try.c, but there are no other testcases. Comments appreciated. Looks OK, except if the x86_64 asm mul_2c will never be used, I think that change is somewhat questionable, and could be kept local. -- Torbj?rn From nisse at lysator.liu.se Sun Feb 26 09:50:37 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Sun, 26 Feb 2012 09:50:37 +0100 Subject: mpn_mul_2c In-Reply-To: <86y5rqfu4o.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Sat, 25 Feb 2012 23:03:35 +0100") References: <86y5rqfu4o.fsf@shell.gmplib.org> Message-ID: Torbjorn Granlund writes: > Are you rewriting x86_64 sqr_basecase with calls to mul_2? If that's > faster than the present code, then I think a version with these mul_2c > inlined will be even better. You're right it's somewhat experimental. About sqr_basebase, I have two working versions. The addmul_1c based version uses this loop, which handles one diagonal term and the off-diagonal terms: for (i = 1; i < n-2; i++) { c0 = ul >> (GMP_LIMB_BITS - 1); ul = up[i]; ulp = ul + c0; c1 = ulp < c0; umul_ppmm (p1, p0, ul, ulp); add_ssaaaa (p1, rp[2*i], p1, p0, -c1, rp[2*i]); rp[n+i] = mpn_addmul_1c (rp + 2*i+1, up + i + 1, n - i - 1, (ul << 1) + c0, p1); } It can't compete with current sqr_basecase on x86_64, since the latter uses addmul_2. It might be useful on other architectures. But it kind-of begs for assembler implementation, to reduce the linear work outside of the addmul_1c call (or maybe it would be good enough with just a special addmul_1 entrypoint). Then I have a version based on addmul_2c (which also uses mul_2c for the first round). It uses this loop for the diagonal terms, for (; i < n-2; i += 2) { umul_ppmm (p1, p0, up[i], up[i+1]); add_ssaaaa (p1, rp[2*i+1], p1, rp[2*i+1], 0, p0); rp[n + i + 1] = mpn_addmul_2c (rp + 2*i + 2, up + i + 2, n - i - 2, up + i, p1); } The iteration adds u_i * u_{i+1} + B * * with the larger term computed by addmul_2c. This version is some 10% slower than current sqr_basecase x86_64. Anyway, I think this organization is promising and simpler than the current addmul_2 loop in sqr_basecase.c. This does *not* shift the offdiagonal terms on the fly; I think that would be a bit too cumbersome in C, maybe it would make sense in assembler. So one needs an mpn_sqr_diag_addlsh1 as well. I'm considering doing that (preferably in assembler) with the following iteration +-------+ |u * u| +---+---+---+ 2* | r1| r0| +---+---+ + | h1| h0| --+-------+---+---+ |h1'|h0'| r1| r0| +---+---+---+---+ Then one can compute u * u + 2*r1 withut carry propagating further. But unfortunately we do need the high carry h1, we can get the maximum value h1 = 1 and h0 = 0. Or do you think it would be better to use an organization like in the general addlsh1_n, computing as many of the diagonal products as one can fit in registers, and then doing a carry propagation add of some 4, 6 limbs or even eight limbs at a time? > Looks OK, except if the x86_64 asm mul_2c will never be used, I think > that change is somewhat questionable, and could be kept local. Well, I can keep it local for the time being. I also have a patch with an addmul_2c entrypoint now. Regards, /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From bodrato at mail.dm.unipi.it Sun Feb 26 17:38:57 2012 From: bodrato at mail.dm.unipi.it (bodrato at mail.dm.unipi.it) Date: Sun, 26 Feb 2012 17:38:57 +0100 (CET) Subject: Double factorial and primorial In-Reply-To: <52976.151.32.175.169.1324316867.squirrel@mail.dm.unipi.it> References: <49296.151.32.244.246.1323285639.squirrel@mail.dm.unipi.it> <86mxb0rxgw.fsf@shell.gmplib.org> <49490.151.32.245.79.1324157913.squirrel@mail.dm.unipi.it> <49400.151.32.167.109.1324279755.squirrel@mail.dm.unipi.it> <52976.151.32.175.169.1324316867.squirrel@mail.dm.unipi.it> Message-ID: <49416.151.32.164.233.1330274337.squirrel@mail.dm.unipi.it> Ciao Il Lun, 19 Dicembre 2011 6:47 pm, bodrato at mail.dm.unipi.it ha scritto: > Il Lun, 19 Dicembre 2011 10:14 am, Niels M?ller ha scritto: >> Since the established name is "double factorial", I think one should >> use some reasonable abbreviation of that term. _dblfac_ seems good >> enough to me, if _double_fac_ is too long. I changed the name one more time: it is mpz_2fac_ui, now. It's in the repo: http://gmplib.org:8000/gmp/rev/f2f516affc0c Regards, m -- http://bodrato.it/software/combinatorics.html From tg at gmplib.org Sun Feb 26 21:56:03 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Sun, 26 Feb 2012 21:56:03 +0100 Subject: More cleaning Message-ID: <86boolfh5o.fsf@shell.gmplib.org> The K&R support was removed a year or two ago. But we still have some clutter remaining from it. I suppose __GMP_PROTO is an example. -- Torbj?rn From nisse at lysator.liu.se Mon Feb 27 07:36:25 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Mon, 27 Feb 2012 07:36:25 +0100 Subject: More cleaning In-Reply-To: <86boolfh5o.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Sun, 26 Feb 2012 21:56:03 +0100") References: <86boolfh5o.fsf@shell.gmplib.org> Message-ID: Torbjorn Granlund writes: > The K&R support was removed a year or two ago. But we still have some > clutter remaining from it. I suppose __GMP_PROTO is an example. It would be nice to get rid of that. Another old remnant is the __GMP_TOKEN_PASTE setup. Regards, /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From nisse at lysator.liu.se Mon Feb 27 10:06:53 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Mon, 27 Feb 2012 10:06:53 +0100 Subject: Problem with the mp_set_memory_functions interface In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Fri, 27 Jan 2012 23:07:16 +0100") References: Message-ID: Any thoughts on this problem with the custom memory allocation functions? nisse at lysator.liu.se (Niels M?ller) writes: > I find this part of the interface, > > : The REALLOCATE_FUNCTION parameter OLD_SIZE and the FREE_FUNCTION > : parameter SIZE are passed for convenience, but of course they can be > : ignored if not needed by an implementation. The default functions > : using `malloc' and friends for instance don't use them. > > extremely awkward, bordering to totally broken. [...] > mpz_get_str, with a NULL buffer argument, is specified as allocating > space *exactly* enough for the digits and the NUL terminator. > > : If STR is `NULL', the result string is allocated using the current > : allocation function (*note Custom Allocation::). The block will be > : `strlen(str)+1' bytes, that being exactly enough for the string and > : null-terminator. > > So if mpz_get_str does the natural thing of allocating > > mpz_sizeinbase (x, base) + is_negative + 1 > > bytes, it *must* use realloc to shrink that allocation in case it turns > out that mpz_sizeinbase returned a value which was one off (as it is > allowed to do). Thats really really ugly, it's an overhead that is > caused by the memory allocation interface, and which is totally > unnecessary for almost all users. > > And then, looking at GMP's mpz/get_str.c, it apparently fails to > handle this case correctly! [...] > we have a bug which will manifest itself if > > 1. An application registers its own allocation function, and the custom > free function really depends on the size argument being correct. > > 2. The application uses mpz_get_str with NULL buffer, and deallocates it > according to the procedure suggested by the mpz_get_str > documentation. > > 3. The application then calls mpz_get_str with a value and a base, for > which mpz_sizeinbase returns a value larger than the actual number of > digits. I'd really prefer not to fix this bug by adding a mostly useless realloc call to mpz_get_str. I think I'd suggest for the short term: 1. Send an message to gmp-announce, asking if there's anybody out there who relies on correct old-size arguments to the reallocate function and the free function. And if so, how painful it would be if that feature was removed. 2. Based on the above, we can hopefully deprecate the old-size argument, and always pass zero (which is what mini-gmp does, btw). Longer term, I think it would make sense to replace this interface, deleting the old-size argument altogether. At the same time, we could rename mp_set_memory_functions to gmp_set_memory_functions. Maybe one could also simplify it a bit by using a function pointer for realloc only (realloc(NULL, size) and realloc(p, 0) could be used as substitutes for malloc and free)? Regards, /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From tg at gmplib.org Mon Feb 27 16:03:01 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Mon, 27 Feb 2012 16:03:01 +0100 Subject: New "fastsse" assembly Message-ID: <861upgnwt6.fsf@shell.gmplib.org> I pushed new x86_64 assembly making use of 128-bit instructions working on xmm registers. While all x86_64 processors probably support the instructions used, some have less throughput using these than when using plain 64-bit instructions. The idea is to include these just before "x86_64" in the mdep search path, but I have not done that yet; I want to look for small-operands regressions first. The files are: * copyi.asm and copyd.asm * lshift.asm (written in cooperation with David Harvey) The challenge when using 128-bit ops is alignment; the limbs are just 64 bits while we work with 128 bit ops, and this means operand alignment is not necessarily better than 64 bits. The code cannot write the first or last limb with 128-bit operations unless these are aligned (the last limb is aligned if either the src is unaligned and the count is odd, or if the src is aligned and the count us even). It is however fine to make an *aligned* read using 128-bit ops always, even if this sometimes mean we read outside of a defined operand (although valgrind seem to dislike that practice...). Further development needed: * The lshift code is now not unrolled. Unroll it 2x or 4x to achieve even better performance (note that the code already runs well on 5 of 10 CPUs). * Make sure lshift does not cause slowdown for small operands. If needed add basecase code to counter slowdown. * Consider loopmixing for individual CPUs. * When lshift is finished, write analogous rshift. * Write copyi/copyd that runs well also on core2 (see comments in these files; basically split loop into two, using movqda also for reads). -- Torbj?rn From tg at gmplib.org Mon Feb 27 16:21:20 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Mon, 27 Feb 2012 16:21:20 +0100 Subject: Sandybridge addmul_N challenge In-Reply-To: <864nuhp9oc.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Thu\, 23 Feb 2012 21\:38\:27 +0100") References: <868vjxbvxl.fsf@shell.gmplib.org> <86sji2khqo.fsf@shell.gmplib.org> <86fwe2qbyp.fsf@shell.gmplib.org> <86d395kgc5.fsf@shell.gmplib.org> <868vjtijoa.fsf@shell.gmplib.org> <864nuhp9oc.fsf@shell.gmplib.org> Message-ID: <86r4xgmhe7.fsf@shell.gmplib.org> Torbjorn Granlund writes: carry-in lo in r14 carry-in hi in rcx mov 0(up), %rax mul v1 mov 8(rp), %r8 add %rax, %r8 mov %rdx, %r9 adc $0, %r9 mov 8(up), %rax mul v0 add %rax, %r8 adc %rdx, %r9 mov $0, R32(%rbx) adc $0, R32(%rbx) add %r14, %r8 C 0 adc %rcx, %r9 C 1 adc $0, R32(%rbx) C might be removed mov %r8, 8(rp) carry-out lo in r9 carry-out hi in rbx I committed code using that block, see mpn/x86_64/coreisbr/addmul_2.asm. In the end, the code runs at about 3.2 c/l. I have not reached 3.0 with complete code. I have no understanding of what limits things. I played with convolution style code, i.e., code that multiplies and accumulated columns-wise. It runs at 2.5 c/l, not counting the final summarisation code: .text .globl main main: push %r12 push %r13 push %r14 mov $3300000000/4, %ecx .align 16 1: mov 8(%rsp), %rax mulq 16(%rsp) add %rax, %r8 adc %rdx, %r9 adc $0, %r10d mov 8(%rsp), %rax mulq 16(%rsp) add %rax, %r12 adc %rdx, %r13 adc $0, %r14d mov 8(%rsp), %rax mulq 16(%rsp) add %rax, %r8 adc %rdx, %r9 adc $0, %r10d mov 8(%rsp), %rax mulq 16(%rsp) add %rax, %r12 adc %rdx, %r13 adc $0, %r10d dec %ecx jnz 1b pop %r14 pop %r13 pop %r12 ret -- Torbj?rn From tg at gmplib.org Tue Feb 28 09:07:47 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Tue, 28 Feb 2012 09:07:47 +0100 Subject: New failures related to recent developments Message-ID: <86399vidnw.fsf@shell.gmplib.org> Three powerpc systems report failure this morning. I suspect they would have reported failure already yesterday, if compilation hadn't failed due to a missing file... The failures seem to be for small cases, for which the code (mpz/lucnum2_ui.c) uses a dumpmp/mini-gmp generated table. I therefore suspect a problem with the new bootstrap.c or mini-gmp.c. I compared mpn/fib_table.c with a system that did not report any failures (but this table to was generated with mini-gmp.c). They have the same contents. But fib-table.h has #define FIB_TABLE_LIMIT 47 #define FIB_TABLE_LUCNUM_LIMIT 47 on one of the failing systems and #define FIB_TABLE_LIMIT 47 #define FIB_TABLE_LUCNUM_LIMIT 46 on a non-failing system. (These are from 32-bit builds. Corresponding differences can be observed on failing 64-bit builds.) -- Torbj?rn From nisse at lysator.liu.se Tue Feb 28 09:35:27 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Tue, 28 Feb 2012 09:35:27 +0100 Subject: New failures related to recent developments In-Reply-To: <86399vidnw.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Tue, 28 Feb 2012 09:07:47 +0100") References: <86399vidnw.fsf@shell.gmplib.org> Message-ID: Torbjorn Granlund writes: > I compared mpn/fib_table.c with a system that did not report any > failures (but this table to was generated with mini-gmp.c). They have > the same contents. I compared the output of gen-fib table 32 0, in current gmp and gmp-5.0.2. Result is identical on my machine. > But fib-table.h has > > #define FIB_TABLE_LIMIT 47 > #define FIB_TABLE_LUCNUM_LIMIT 47 > > on one of the failing systems and > > #define FIB_TABLE_LIMIT 47 > #define FIB_TABLE_LUCNUM_LIMIT 46 I also get #define FIB_TABLE_LUCNUM_LIMIT 46 for both current and gmp-5.0.2. So I think we can conclude that's the correct definition. I'm not entirely sure I understand fib-gen is supposed work. luc_limit is only assigned like this, in gen-fib.c:generate, if (mpz_cmp (l, limit) < 0) luc_limit = i-1; Looking at mini-gmp.c:mpz_cmp, I've spotted one bug, but I think that's unrelated since it affects negative numbers only. diff -r e21157bb513d mini-gmp/mini-gmp.c --- a/mini-gmp/mini-gmp.c Mon Feb 27 14:37:02 2012 +0100 +++ b/mini-gmp/mini-gmp.c Tue Feb 28 09:26:59 2012 +0100 @@ -1694,7 +1694,7 @@ mpz_cmp (const mpz_t a, const mpz_t b) else if (asize > 0) return mpn_cmp (a->_mp_d, b->_mp_d, asize); else if (asize < 0) - return -mpn_cmp (a->_mp_d, b->_mp_d, asize); + return -mpn_cmp (a->_mp_d, b->_mp_d, -asize); else return 0; } Regards, /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From tg at gmplib.org Tue Feb 28 09:47:52 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Tue, 28 Feb 2012 09:47:52 +0100 Subject: New failures related to recent developments In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Tue\, 28 Feb 2012 09\:35\:27 +0100") References: <86399vidnw.fsf@shell.gmplib.org> Message-ID: <86r4xffio7.fsf@shell.gmplib.org> nisse at lysator.liu.se (Niels M?ller) writes: I'm not entirely sure I understand fib-gen is supposed work. luc_limit is only assigned like this, in gen-fib.c:generate, if (mpz_cmp (l, limit) < 0) luc_limit = i-1; Looking at mini-gmp.c:mpz_cmp, I've spotted one bug, but I think that's unrelated since it affects negative numbers only. diff -r e21157bb513d mini-gmp/mini-gmp.c --- a/mini-gmp/mini-gmp.c Mon Feb 27 14:37:02 2012 +0100 +++ b/mini-gmp/mini-gmp.c Tue Feb 28 09:26:59 2012 +0100 @@ -1694,7 +1694,7 @@ mpz_cmp (const mpz_t a, const mpz_t b) else if (asize > 0) return mpn_cmp (a->_mp_d, b->_mp_d, asize); else if (asize < 0) - return -mpn_cmp (a->_mp_d, b->_mp_d, asize); + return -mpn_cmp (a->_mp_d, b->_mp_d, -asize); else return 0; } Since this happens on just a few systems, I don't think it is a simple logical bug. I would guess it is a dependency on uninitialised data. The failing systems seem to have no working debugging environment (which I understand). -- Torbj?rn From nisse at lysator.liu.se Tue Feb 28 10:14:11 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Tue, 28 Feb 2012 10:14:11 +0100 Subject: New failures related to recent developments In-Reply-To: <86r4xffio7.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Tue, 28 Feb 2012 09:47:52 +0100") References: <86399vidnw.fsf@shell.gmplib.org> <86r4xffio7.fsf@shell.gmplib.org> Message-ID: Torbjorn Granlund writes: > The failing systems seem to have no working debugging environment (which > I understand). Printing out the value of limit (ought to be two limbs: 0, 1) and inputs and output of the mpz_cmp call would be helpful, I think. Then we'll see which operation goes wrong. /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From tg at gmplib.org Tue Feb 28 10:14:22 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Tue, 28 Feb 2012 10:14:22 +0100 Subject: New failures related to recent developments In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Tue\, 28 Feb 2012 09\:35\:27 +0100") References: <86399vidnw.fsf@shell.gmplib.org> Message-ID: <86mx83fhg1.fsf@shell.gmplib.org> There is a systematic problem in mini-gmp.c when MPZ_REALLOC is called when a destination variable is the same as some other (source or destination) variable. After MPZ_REALLOC, all cached pointers must be considered to be defunct. I've spotted this error in 4 functions, but I haven't made a proper code review. -- Torbj?rn From tg at gmplib.org Tue Feb 28 10:18:30 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Tue, 28 Feb 2012 10:18:30 +0100 Subject: New failures related to recent developments In-Reply-To: <86mx83fhg1.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Tue\, 28 Feb 2012 10\:14\:22 +0100") References: <86399vidnw.fsf@shell.gmplib.org> <86mx83fhg1.fsf@shell.gmplib.org> Message-ID: <86haybfh95.fsf@shell.gmplib.org> Torbjorn Granlund writes: There is a systematic problem in mini-gmp.c when MPZ_REALLOC is called when a destination variable is the same as some other (source or destination) variable. After MPZ_REALLOC, all cached pointers must be considered to be defunct. I've spotted this error in 4 functions, but I haven't made a proper code review. This gives an idea for a testing mode allocation trick: Let the MPZ_REALLOC macro always allocate a new block whether needed or not, copy the data thereto, write random garbage to the old area, then free it. This will make any defunct pointers read data that very likely will cause an obvious miscomputation. -- Torbj?rn From nisse at lysator.liu.se Tue Feb 28 10:23:40 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Tue, 28 Feb 2012 10:23:40 +0100 Subject: New failures related to recent developments In-Reply-To: <86mx83fhg1.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Tue, 28 Feb 2012 10:14:22 +0100") References: <86399vidnw.fsf@shell.gmplib.org> <86mx83fhg1.fsf@shell.gmplib.org> Message-ID: Torbjorn Granlund writes: > After MPZ_REALLOC, all cached pointers must be considered to be defunct. It seems I was only paying attention to the destination pointer. I'll go through all uses of MPZ_REALLOC. Regards, /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From tg at gmplib.org Tue Feb 28 10:28:20 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Tue, 28 Feb 2012 10:28:20 +0100 Subject: New failures related to recent developments In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Tue\, 28 Feb 2012 10\:23\:40 +0100") References: <86399vidnw.fsf@shell.gmplib.org> <86mx83fhg1.fsf@shell.gmplib.org> Message-ID: <86boojfgsr.fsf@shell.gmplib.org> nisse at lysator.liu.se (Niels M?ller) writes: Torbjorn Granlund writes: > After MPZ_REALLOC, all cached pointers must be considered to be defunct. It seems I was only paying attention to the destination pointer. I'll go through all uses of MPZ_REALLOC. This is a (partial?) patch. It seems to fix the present problem. -------------- next part -------------- A non-text attachment was scrubbed... Name: diff Type: application/octet-stream Size: 1856 bytes Desc: not available URL: -------------- next part -------------- -- Torbj?rn From nisse at lysator.liu.se Tue Feb 28 11:15:06 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Tue, 28 Feb 2012 11:15:06 +0100 Subject: New failures related to recent developments In-Reply-To: <86boojfgsr.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Tue, 28 Feb 2012 10:28:20 +0100") References: <86399vidnw.fsf@shell.gmplib.org> <86mx83fhg1.fsf@shell.gmplib.org> <86boojfgsr.fsf@shell.gmplib.org> Message-ID: Torbjorn Granlund writes: > This is a (partial?) patch. It seems to fix the present problem. > + rp = MPZ_REALLOC (r, an + 1); > + > + ap = a->_mp_d; > + bp = b->_mp_d; > + > if (an < bn) > MPN_PTR_SWAP (ap, an, bp, bn); > > cy = mpn_add (rp, ap, an, bp, bn); > rp[an] = cy; I think this fix to mpz_abs_add is almost right, but the realloc must use a size MAX(an, bn) + 1. Maybe it ought to be reorganized a bit, eliminating the ap, bp pointers and the swapping. Something like rn = GMP_MAX (an, bn); rp = MPZ_REALLOC (r, rn + 1); if (an < bn) cy = mpn_add (rp, b->_mp_d, bn, a->_mp_d, an); else cy = mpn_add (rp, a->_mp_d, an, b->_mp_d, bn); if (cy > 0) rp[rn++] = cy; Will you commit these fixes, or do you want me to do that? I have found the same four direct MPZ_REALLOC problems when reviewing the code: mpz_abs_add, mpz_and, mpz_ior and mpz_xor. Then I have loooked for functions which use cached pointers over a call to a function using MPZ_REALLOC. But I didn't find any problems of that type. There are couple of additional pointers cached over an MPZ_REALLOC of a temporary, but that shouldn't be a problem since the temporary never overlaps anything else. Regards, /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From tg at gmplib.org Tue Feb 28 11:21:25 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Tue, 28 Feb 2012 11:21:25 +0100 Subject: New failures related to recent developments In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Tue\, 28 Feb 2012 11\:15\:06 +0100") References: <86399vidnw.fsf@shell.gmplib.org> <86mx83fhg1.fsf@shell.gmplib.org> <86boojfgsr.fsf@shell.gmplib.org> Message-ID: <86399vfeca.fsf@shell.gmplib.org> nisse at lysator.liu.se (Niels M?ller) writes: Will you commit these fixes, or do you want me to do that? Please take care of any fixing. It's your baby after all. :-) I have found the same four direct MPZ_REALLOC problems when reviewing the code: mpz_abs_add, mpz_and, mpz_ior and mpz_xor. Then I have loooked for functions which use cached pointers over a call to a function using MPZ_REALLOC. But I didn't find any problems of that type. These other usages all seem to be safe, I finished a review too, since four eyes see more than two. (I think the mini-gmp test suite should have caught these errors. I'd suggest that you enhance it, and make sure these types of errors are actually detected.) -- Torbj?rn From nisse at lysator.liu.se Tue Feb 28 15:40:00 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Tue, 28 Feb 2012 15:40:00 +0100 Subject: New failures related to recent developments In-Reply-To: <86399vfeca.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Tue, 28 Feb 2012 11:21:25 +0100") References: <86399vidnw.fsf@shell.gmplib.org> <86mx83fhg1.fsf@shell.gmplib.org> <86boojfgsr.fsf@shell.gmplib.org> <86399vfeca.fsf@shell.gmplib.org> Message-ID: Torbjorn Granlund writes: > Please take care of any fixing. It's your baby after all. :-) Checked in now. Hope it's sufficient. > (I think the mini-gmp test suite should have caught these errors. I'd > suggest that you enhance it, and make sure these types of errors are > actually detected.) I've added some make rules to make it possible to run make check-mini-gmp from the gmp tree. It's should work both when building in the srcdir and when using a separate build dir. Limitations: It uses mini-gmp/tests/Makefile which requires GNU make. And built files are not removed by make clean (I'd need to check the automake manual for where to put that). No changes to the actual tests, yet. There's lots of room for improvements there. The bug in mpz_abs_add actually triggered failures in mini-gmp's t-gcd and t-powm. But they didn't earlier when I built mini-gmp separately, I don't understand why. Regards, /Niels -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From tg at gmplib.org Tue Feb 28 19:55:19 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Tue, 28 Feb 2012 19:55:19 +0100 Subject: New failures related to recent developments In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Tue\, 28 Feb 2012 15\:40\:00 +0100") References: <86399vidnw.fsf@shell.gmplib.org> <86mx83fhg1.fsf@shell.gmplib.org> <86boojfgsr.fsf@shell.gmplib.org> <86399vfeca.fsf@shell.gmplib.org> Message-ID: <867gz6aiug.fsf@shell.gmplib.org> nisse at lysator.liu.se (Niels M?ller) writes: Torbjorn Granlund writes: > Please take care of any fixing. It's your baby after all. :-) Checked in now. Hope it's sufficient. We'll see tomorrow morning. Note that the failures are sticky, a passing "make check" does not supplant a previous failing one, since that could lead us to miss a failure. Once the new builds are done, I will check that the previously failing seeds now preduce no error, then clear the failure to make the reporting square become green (gmplib.org/devel/tm-date.html). I've added some make rules to make it possible to run make check-mini-gmp from the gmp tree. It's should work both when building in the srcdir and when using a separate build dir. Limitations: It uses mini-gmp/tests/Makefile which requires GNU make. And built files are not removed by make clean (I'd need to check the automake manual for where to put that). Please then add both a clean and a distclean target (perhaps working in the exact same way). No changes to the actual tests, yet. There's lots of room for improvements there. Do you test reuse at all? The bug in mpz_abs_add actually triggered failures in mini-gmp's t-gcd and t-powm. But they didn't earlier when I built mini-gmp separately, I don't understand why. Different optimisation? -- Torbj?rn From nisse at lysator.liu.se Tue Feb 28 23:12:04 2012 From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=) Date: Tue, 28 Feb 2012 23:12:04 +0100 Subject: New failures related to recent developments In-Reply-To: <867gz6aiug.fsf@shell.gmplib.org> (Torbjorn Granlund's message of "Tue, 28 Feb 2012 19:55:19 +0100") References: <86399vidnw.fsf@shell.gmplib.org> <86mx83fhg1.fsf@shell.gmplib.org> <86boojfgsr.fsf@shell.gmplib.org> <86399vfeca.fsf@shell.gmplib.org> <867gz6aiug.fsf@shell.gmplib.org> Message-ID: Torbjorn Granlund writes: > Please then add both a clean and a distclean target (perhaps working in > the exact same way). I'll look into it. I figured it's not urgent, since the files to be cleaned away are not built by default, only when the check-mini-gmp target is used explicitly. > No changes to the actual tests, yet. There's lots of room for > improvements there. > > Do you test reuse at all? No. It would be good to do that systematically (like the GMP testsuite). Possibly in combination with the MPZ_REALLOC hack you mentioned. /nisse -- Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. From tg at gmplib.org Tue Feb 28 23:19:38 2012 From: tg at gmplib.org (Torbjorn Granlund) Date: Tue, 28 Feb 2012 23:19:38 +0100 Subject: New failures related to recent developments In-Reply-To: ("Niels =?iso-8859-1?Q?M=F6ller=22's?= message of "Tue\, 28 Feb 2012 23\:12\:04 +0100") References: <86399vidnw.fsf@shell.gmplib.org> <86mx83fhg1.fsf@shell.gmplib.org> <86boojfgsr.fsf@shell.gmplib.org> <86399vfeca.fsf@shell.gmplib.org> <867gz6aiug.fsf@shell.gmplib.org> Message-ID: <86vcmq8uth.fsf@shell.gmplib.org> > Please then add both a clean and a distclean target (perhaps working in > the exact same way). I'll look into it. I figured it's not urgent, since the files to be cleaned away are not built by default, only when the check-mini-gmp target is used explicitly. I agree this is highly non-urgent. -) I just mentioned the desirable targets to have your TODO file contain the right entry, avoiding the need to patch the patch. No. It would be good to do that systematically (like the GMP testsuite). Possibly in combination with the MPZ_REALLOC hack you mentioned. It might be easiest to steal GMP's mpz/reuse.c since its table-driven approach should make it a very quick job. -- Torbj?rn