From rth at twiddle.net  Wed Feb  1 13:25:43 2012
From: rth at twiddle.net (Richard Henderson)
Date: Wed, 01 Feb 2012 23:25:43 +1100
Subject: Some arm cortex-a8 improvements
Message-ID: <4F292F47.6020306@twiddle.net>

Three patches herein.  If there's a better way to submit patches,
please advise; I've never used hg before.

The first patch gives gcc control over ctz/clz.  Particularly for
armv6t2 and later, which have rbit for use for ctz.

The second patch improves multiplication a bit.  I'm still playing
with addmul_2, but this is a start for addmul_1/mul_1.  I couldn't
do better than the existing submul_1.  Unfortunately the Xscale
machines in the gcc build farm are turned off, so I can't test to
see if I've regressed on that platform.

The third patch tidies up add_n/sub_n, and provides for the carry-in
entry points.

It's a bit touchy speed testing these.  There's no cycle counter
available in userspace, and Hz is depressingly low.  So I've had
to bump the minimum iterations way way up in order to get semi-
reliable results.  Which causes the speed testing to take quite
a long time.

Feedback welcome.


r~
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: zz
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20120201/69a10b86/attachment.ksh>

From tg at gmplib.org  Fri Feb 10 14:19:05 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Fri, 10 Feb 2012 14:19:05 +0100
Subject: Mainline repo
Message-ID: <867gzuhlme.fsf@shell.gmplib.org>

I am moving the nightly testing back from the gmp-5.0 repo to the
mainline gmp repo.  I'll keep the extra barbwire switched on for now; at
a later point I will have the nightly builds create two library builds,
one with extra checks and without.  The latter will be used for tuneup.

We made several bug fixes to gmp-5.0 that are not yet in the mainline
gmp.  Please remember to move your fixes over.  (It is probably best to
copy the change log entry into the same position, so that discrepances
therein better reflect actualy source code differences.  That helps me a
lot when making dot releases.)

-- 
Torbj?rn

From nisse at lysator.liu.se  Fri Feb 10 22:19:52 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Fri, 10 Feb 2012 22:19:52 +0100
Subject: Mainline repo
In-Reply-To: <867gzuhlme.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Fri, 10 Feb 2012 14:19:05 +0100")
References: <867gzuhlme.fsf@shell.gmplib.org>
Message-ID: <nny5sa1j47.fsf@stalhein.lysator.liu.se>

Torbjorn Granlund <tg at gmplib.org> writes:

> We made several bug fixes to gmp-5.0 that are not yet in the mainline
> gmp.  Please remember to move your fixes over.

Moved my changes over now. Summary, by commit id in the gmp-5.0 repo:

13549:5bed10c29692 Assert fix (u0 == u1 case), mpn_gcdext.

  Fix copied.
 
13548:77785806d3f1 hgcd_matrix_update_q bug.

  Fix copied, and code slightly simplified.

13547:ec2c2959dc8c t-gcd and t-hgcd test cases.

  Improvements copied. 

13545:11a901ce5242 gcdext_subdiv_step normalization fix.

  Current mpn_gcdext_hook seem to be correct.
 
13544:eab9e2a8bf48 gcdext_subdiv_step carry

  Uses different code in gcdext_lehmer.c:mpn_gcdext_hook. Related fix
  to u1n < un case.

> (It is probably best to copy the change log entry into the same
> position, so that discrepances therein better reflect actualy source
> code differences.

I'm not sure what order you'd prefer. For three of the above changes, I
could copy the ChangeLog entries almost verbatim: I just set the date to
today's date, I added a note that they originated in the 5.0 repo, and
in one case I had to change the file name since hgcd_matrix_update_q was
moved from hgcd.c to hgcd_matrix.c. The other bugfixes had to be handled
differently in the main repo.

Would you prefer to have the copied entries sorted by the original date?

Regards,
/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From tg at gmplib.org  Fri Feb 10 22:22:15 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Fri, 10 Feb 2012 22:22:15 +0100
Subject: Mainline repo
In-Reply-To: <nny5sa1j47.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?= message of "Fri\, 10 Feb 2012 22\:19\:52
	+0100")
References: <867gzuhlme.fsf@shell.gmplib.org>
	<nny5sa1j47.fsf@stalhein.lysator.liu.se>
Message-ID: <86ty2yl6yg.fsf@shell.gmplib.org>

nisse at lysator.liu.se (Niels M?ller) writes:

  Would you prefer to have the copied entries sorted by the original date?
  
Yes, please.

(We usually also add just one dated entry per hacker and day.  That
keeps te ChangeLog file size down.)

-- 
Torbj?rn

From Paul.Zimmermann at loria.fr  Sun Feb 12 18:02:25 2012
From: Paul.Zimmermann at loria.fr (Zimmermann Paul)
Date: Sun, 12 Feb 2012 18:02:25 +0100
Subject: fixed-size mpn_mul_n for small n?
Message-ID: <E1Rwcoj-0002hc-Py@merguez.loria.fr>

       Hi,

GMP currently has variable-size assembly code for mpn_mul_n on some
processors. Could it be faster to have fixed-size assembly code for
small values of n (say up to n=32)? Then mpn_mul_n() would simply be
a wrapper to those fixed-size functions, or to a variable-size code
for n>32.

Paul Zimmermann

From tg at gmplib.org  Sun Feb 12 18:10:01 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Sun, 12 Feb 2012 18:10:01 +0100
Subject: fixed-size mpn_mul_n for small n?
In-Reply-To: <E1Rwcoj-0002hc-Py@merguez.loria.fr> (Zimmermann Paul's message
	of "Sun\, 12 Feb 2012 18\:02\:25 +0100")
References: <E1Rwcoj-0002hc-Py@merguez.loria.fr>
Message-ID: <861uq0c712.fsf@shell.gmplib.org>

Zimmermann Paul <Paul.Zimmermann at loria.fr> writes:

  GMP currently has variable-size assembly code for mpn_mul_n on some
  processors. Could it be faster to have fixed-size assembly code for
  small values of n (say up to n=32)? Then mpn_mul_n() would simply be
  a wrapper to those fixed-size functions, or to a variable-size code
  for n>32.
  
There are assembly mpn_mul_basecase for a lot of machines.  Some of
these offer special code for certain small un,vn.  This gives two
benefits: (1) less overhead for small sizes amd (2) simpler general
code.

Providing special code for many un,vn combinations (as separate
functions are as part of mpn_mul_basecase) quickly become unmanageable.
If we want to handle all sizes <= 16 (say) we'll need 136 variants.

I don't think it makes much sense providing code for just un=vn (except
that it becomes more manageable...).

-- 
Torbj?rn

From Paul.Zimmermann at loria.fr  Sun Feb 12 18:44:25 2012
From: Paul.Zimmermann at loria.fr (Zimmermann Paul)
Date: Sun, 12 Feb 2012 18:44:25 +0100
Subject: fixed-size mpn_mul_n for small n?
In-Reply-To: <861uq0c712.fsf@shell.gmplib.org> (message from Torbjorn Granlund
	on Sun, 12 Feb 2012 18:10:01 +0100)
References: <E1Rwcoj-0002hc-Py@merguez.loria.fr>
	<861uq0c712.fsf@shell.gmplib.org>
Message-ID: <E1RwdTN-0003uZ-7r@merguez.loria.fr>

       Torbj?rn,

> Providing special code for many un,vn combinations (as separate
> functions are as part of mpn_mul_basecase) quickly become unmanageable.
> If we want to handle all sizes <= 16 (say) we'll need 136 variants.
> 
> I don't think it makes much sense providing code for just un=vn (except
> that it becomes more manageable...).

I was thinking of applications doing heavy computations in modular arithmetic,
like GMP-ECM, where we only need the case un=vn.

I am trying to optimize the modular multiplications and squarings in GMP-ECM
(where we use Montgomery's reduction). Below are some timings of low-level
functions for different limb sizes up to n=20 on a AMD Phenom(tm) II X2 B55
running at 3Ghz.

Paul

******************
Time in microseconds per call, size=1
mpn_mul_n  = 0.009601
mpn_sqr    = 0.007200
mpn_redc_1 = 0.010401
mpn_redc_2 = 0.013201
mulredc    = 0.004800
mul+redc_1 = 0.019601
mul+redc_2 = 0.022801
mul+redc3  = 0.017201
sqr+redc_1 = 0.018401
sqr+redc_2 = 0.020401
sqr+redc3  = 0.014801
mulredc1   = 0.004800
******************
Time in microseconds per call, size=2
mpn_mul_n  = 0.013601
mpn_sqr    = 0.009601
mpn_redc_1 = 0.017601
mpn_redc_2 = 0.020001
mulredc    = 0.018401
mul+redc_1 = 0.031202
mul+redc_2 = 0.032802
mul+redc3  = 0.032002
sqr+redc_1 = 0.031202
sqr+redc_2 = 0.032802
sqr+redc3  = 0.028002
mulredc1   = 0.007600
******************
Time in microseconds per call, size=3
mpn_mul_n  = 0.018001
mpn_sqr    = 0.013201
mpn_redc_1 = 0.026402
mpn_redc_2 = 0.031202
mulredc    = 0.028802
mul+redc_1 = 0.046803
mul+redc_2 = 0.050403
mul+redc3  = 0.048003
sqr+redc_1 = 0.039602
sqr+redc_2 = 0.045603
sqr+redc3  = 0.040802
mulredc1   = 0.010001
******************
Time in microseconds per call, size=4
mpn_mul_n  = 0.024002
mpn_sqr    = 0.017601
mpn_redc_1 = 0.038402
mpn_redc_2 = 0.036802
mulredc    = 0.046403
mul+redc_1 = 0.062404
mul+redc_2 = 0.062404
mul+redc3  = 0.064004
sqr+redc_1 = 0.056004
sqr+redc_2 = 0.056004
sqr+redc3  = 0.056004
mulredc1   = 0.011601
******************
Time in microseconds per call, size=5
mpn_mul_n  = 0.034002
mpn_sqr    = 0.028002
mpn_redc_1 = 0.050003
mpn_redc_2 = 0.056003
mulredc    = 0.066004
mul+redc_1 = 0.084005
mul+redc_2 = 0.088005
mul+redc3  = 0.092005
sqr+redc_1 = 0.078005
sqr+redc_2 = 0.082005
sqr+redc3  = 0.084005
mulredc1   = 0.013601
******************
Time in microseconds per call, size=6
mpn_mul_n  = 0.040802
mpn_sqr    = 0.036002
mpn_redc_1 = 0.064804
mpn_redc_2 = 0.064804
mulredc    = 0.091205
mul+redc_1 = 0.108007
mul+redc_2 = 0.108007
mul+redc3  = 0.117607
sqr+redc_1 = 0.100806
sqr+redc_2 = 0.098407
sqr+redc3  = 0.110407
mulredc1   = 0.014801
******************
Time in microseconds per call, size=7
mpn_mul_n  = 0.056004
mpn_sqr    = 0.042003
mpn_redc_1 = 0.089606
mpn_redc_2 = 0.086806
mulredc    = 0.120408
mul+redc_1 = 0.145609
mul+redc_2 = 0.142808
mul+redc3  = 0.148410
sqr+redc_1 = 0.128808
sqr+redc_2 = 0.128808
sqr+redc3  = 0.131608
mulredc1   = 0.016801
******************
Time in microseconds per call, size=8
mpn_mul_n  = 0.067204
mpn_sqr    = 0.051203
mpn_redc_1 = 0.102406
mpn_redc_2 = 0.102406
mulredc    = 0.153610
mul+redc_1 = 0.169610
mul+redc_2 = 0.163210
mul+redc3  = 0.179211
sqr+redc_1 = 0.153610
sqr+redc_2 = 0.147210
sqr+redc3  = 0.163210
mulredc1   = 0.018401
******************
Time in microseconds per call, size=9
mpn_mul_n  = 0.090005
mpn_sqr    = 0.061204
mpn_redc_1 = 0.126008
mpn_redc_2 = 0.122408
mulredc    = 0.190812
mul+redc_1 = 0.212414
mul+redc_2 = 0.219614
mul+redc3  = 0.219614
sqr+redc_1 = 0.183612
sqr+redc_2 = 0.187212
sqr+redc3  = 0.194412
mulredc1   = 0.020001
******************
Time in microseconds per call, size=10
mpn_mul_n  = 0.100006
mpn_sqr    = 0.072005
mpn_redc_1 = 0.140008
mpn_redc_2 = 0.144009
mulredc    = 0.232015
mul+redc_1 = 0.240015
mul+redc_2 = 0.248015
mul+redc3  = 0.264017
sqr+redc_1 = 0.216013
sqr+redc_2 = 0.216014
sqr+redc3  = 0.232014
mulredc1   = 0.022001
******************
Time in microseconds per call, size=11
mpn_mul_n  = 0.123208
mpn_sqr    = 0.079206
mpn_redc_1 = 0.162810
mpn_redc_2 = 0.167210
mulredc    = 0.277218
mul+redc_1 = 0.281618
mul+redc_2 = 0.303619
mul+redc3  = 0.303620
sqr+redc_1 = 0.246416
sqr+redc_2 = 0.246416
sqr+redc3  = 0.264017
mulredc1   = 0.023601
******************
Time in microseconds per call, size=12
mpn_mul_n  = 0.139210
mpn_sqr    = 0.096006
mpn_redc_1 = 0.192012
mpn_redc_2 = 0.182411
mulredc    = 0.326421
mul+redc_1 = 0.331221
mul+redc_2 = 0.321621
mul+redc3  = 0.355223
sqr+redc_1 = 0.283217
sqr+redc_2 = 0.283218
sqr+redc3  = 0.316821
mulredc1   = 0.024802
******************
Time in microseconds per call, size=13
mpn_mul_n  = 0.171611
mpn_sqr    = 0.109208
mpn_redc_1 = 0.213213
mpn_redc_2 = 0.218413
mulredc    = 0.379625
mul+redc_1 = 0.384824
mul+redc_2 = 0.390025
mul+redc3  = 0.421226
sqr+redc_1 = 0.327621
sqr+redc_2 = 0.327621
sqr+redc3  = 0.353622
mulredc1   = 0.026802
******************
Time in microseconds per call, size=14
mpn_mul_n  = 0.184813
mpn_sqr    = 0.123207
mpn_redc_1 = 0.229614
mpn_redc_2 = 0.229616
mulredc    = 0.431227
mul+redc_1 = 0.414426
mul+redc_2 = 0.420027
mul+redc3  = 0.464830
sqr+redc_1 = 0.352823
sqr+redc_2 = 0.352821
sqr+redc3  = 0.392026
mulredc1   = 0.029202
******************
Time in microseconds per call, size=15
mpn_mul_n  = 0.204014
mpn_sqr    = 0.138008
mpn_redc_1 = 0.270018
mpn_redc_2 = 0.264017
mulredc    = 0.498030
mul+redc_1 = 0.474030
mul+redc_2 = 0.492032
mul+redc3  = 0.522032
sqr+redc_1 = 0.402026
sqr+redc_2 = 0.408026
sqr+redc3  = 0.456029
mulredc1   = 0.030002
******************
Time in microseconds per call, size=16
mpn_mul_n  = 0.236814
mpn_sqr    = 0.153610
mpn_redc_1 = 0.307219
mpn_redc_2 = 0.294419
mulredc    = 0.556834
mul+redc_1 = 0.537634
mul+redc_2 = 0.524834
mul+redc3  = 0.582437
sqr+redc_1 = 0.454427
sqr+redc_2 = 0.448029
sqr+redc3  = 0.505632
mulredc1   = 0.031602
******************
Time in microseconds per call, size=17
mpn_mul_n  = 0.278819
mpn_sqr    = 0.163210
mpn_redc_1 = 0.333221
mpn_redc_2 = 0.326421
mulredc    = 0.632439
mul+redc_1 = 0.605238
mul+redc_2 = 0.612039
mul+redc3  = 0.659641
sqr+redc_1 = 0.503233
sqr+redc_2 = 0.489631
sqr+redc3  = 0.564434
mulredc1   = 0.034002
******************
Time in microseconds per call, size=18
mpn_mul_n  = 0.295218
mpn_sqr    = 0.187211
mpn_redc_1 = 0.352824
mpn_redc_2 = 0.352822
mulredc    = 0.705644
mul+redc_1 = 0.648042
mul+redc_2 = 0.640840
mul+redc3  = 0.734448
sqr+redc_1 = 0.540033
sqr+redc_2 = 0.540035
sqr+redc3  = 0.626440
mulredc1   = 0.035602
******************
Time in microseconds per call, size=19
mpn_mul_n  = 0.319221
mpn_sqr    = 0.197612
mpn_redc_1 = 0.395225
mpn_redc_2 = 0.418027
mulredc    = 0.782851
mul+redc_1 = 0.714445
mul+redc_2 = 0.729647
mul+redc3  = 0.805653
sqr+redc_1 = 0.608039
sqr+redc_2 = 0.623239
sqr+redc3  = 0.691645
mulredc1   = 0.036802
******************
Time in microseconds per call, size=20
mpn_mul_n  = 0.360022
mpn_sqr    = 0.224014
mpn_redc_1 = 0.440028
mpn_redc_2 = 0.440028
mulredc    = 0.864054
mul+redc_1 = 0.792048
mul+redc_2 = 0.792050
mul+redc3  = 0.880056
sqr+redc_1 = 0.664040
sqr+redc_2 = 0.656042
sqr+redc3  = 0.760048
mulredc1   = 0.038402

From nisse at lysator.liu.se  Sun Feb 12 20:06:20 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Sun, 12 Feb 2012 20:06:20 +0100
Subject: fixed-size mpn_mul_n for small n?
In-Reply-To: <E1RwdTN-0003uZ-7r@merguez.loria.fr> (Zimmermann Paul's message
	of "Sun, 12 Feb 2012 18:44:25 +0100")
References: <E1Rwcoj-0002hc-Py@merguez.loria.fr>
	<861uq0c712.fsf@shell.gmplib.org> <E1RwdTN-0003uZ-7r@merguez.loria.fr>
Message-ID: <nnpqdj27o3.fsf@stalhein.lysator.liu.se>

Zimmermann Paul <Paul.Zimmermann at loria.fr> writes:

> I am trying to optimize the modular multiplications and squarings in GMP-ECM
> (where we use Montgomery's reduction).

For these moderate sizes, does it pay off to precompute a full inverse
for the montgomery reduction, rather than using redc_1 or redc_2? (I
don't quite understand which lines in you benchmark data I should look
at).

Have you tried the "bidirectional" trick we discussed a while ago (IIRC
it was your idea): For size n, instead of the standard montgomery
representation x' = B^n x (mod m), use the representation x' = B^{n/2} x (mod
m), and each time a size 2n product is to be reduced, cancel n/2
limbs from the right, and n/2 from the left? For the _1 version, it
might even make sense to make a single loop working from both ends.

And back to the original question: I guess you could try to completely
unroll the multiplication for some size of interest, and compare to the
general mpn_mul_basecase.

My understanding is that current branch predictors are fairly good at
the case of a loop which always runs for the same number of iterations,
so the loop overhead (even without unrolling) shouldn't be severe.
Totally unrolled code might have a greater potential for speedups for
squaring, mullo and mulhi, than for mul_basecase, since the former
probably are less friendly to the branch prediction.

Regards,
/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From Paul.Zimmermann at loria.fr  Sun Feb 12 20:29:51 2012
From: Paul.Zimmermann at loria.fr (Zimmermann Paul)
Date: Sun, 12 Feb 2012 20:29:51 +0100
Subject: fixed-size mpn_mul_n for small n?
In-Reply-To: <nnpqdj27o3.fsf@stalhein.lysator.liu.se> (nisse@lysator.liu.se)
References: <E1Rwcoj-0002hc-Py@merguez.loria.fr>
	<861uq0c712.fsf@shell.gmplib.org> <E1RwdTN-0003uZ-7r@merguez.loria.fr>
	<nnpqdj27o3.fsf@stalhein.lysator.liu.se>
Message-ID: <E1Rwf7P-00078u-5H@merguez.loria.fr>

       Niels,

> For these moderate sizes, does it pay off to precompute a full inverse
> for the montgomery reduction, rather than using redc_1 or redc_2? (I
> don't quite understand which lines in you benchmark data I should look
> at).

I believe those sizes are too small. With a full inverse, we need that two
short products (one mullo and one mulhi) are faster than redc_1 or redc_2.
This does not seem to be the case for 14 limbs, where mpn_redc_n takes
0.45us, compared to 0.23us for mpn_redc_1:

Time in microseconds per call, size=14
mpn_mul_n  = 0.179211
mpn_sqr    = 0.123209
mpn_redc_1 = 0.229614
mpn_redc_2 = 0.235214
ecm_redc3  = 0.274418 # this is GMP-ECM variable-size assembly redc
mpn_redc_n = 0.448028
mulredc    = 0.431227 # this is GMP-ECM assembly combined mul+redc
mul+redc_1 = 0.420027
mul+redc_2 = 0.414426
mul+redc3  = 0.464830
mul+redc_n = 0.627240
sqr+redc_1 = 0.358423
sqr+redc_2 = 0.352823
sqr+redc3  = 0.403226
sqr+redc_n = 0.554434

Legend: "mul" means mpn_mul_n, "sqr" means mpn_sqr

> Have you tried the "bidirectional" trick we discussed a while ago (IIRC
> it was your idea): For size n, instead of the standard montgomery
> representation x' = B^n x (mod m), use the representation x' = B^{n/2} x (mod
> m), and each time a size 2n product is to be reduced, cancel n/2
> limbs from the right, and n/2 from the left? For the _1 version, it
> might even make sense to make a single loop working from both ends.

no I didn't. But it doesn't save any computation, since to cancel n/2 limbs,
you need n/2 addmul_1 calls of size n, thus in total n addmul_1 calls of size
n. The only benefit is when using 2 threads reducing from the left and right.

> And back to the original question: I guess you could try to completely
> unroll the multiplication for some size of interest, and compare to the
> general mpn_mul_basecase.

I know nothing about assembly, this is why I asked on this list :-)

> My understanding is that current branch predictors are fairly good at
> the case of a loop which always runs for the same number of iterations,
> so the loop overhead (even without unrolling) shouldn't be severe.
> Totally unrolled code might have a greater potential for speedups for
> squaring, mullo and mulhi, than for mul_basecase, since the former
> probably are less friendly to the branch prediction.

ok. Any speedup in squaring (mpn_sqr) is also most welcome, since our best
ECM Stage 1 code performs 4 modular multiplications and 4 modular squaring
per iteration.

Paul


From nisse at lysator.liu.se  Mon Feb 13 13:12:03 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Mon, 13 Feb 2012 13:12:03 +0100
Subject: fixed-size mpn_mul_n for small n?
In-Reply-To: <E1Rwf7P-00078u-5H@merguez.loria.fr> (Zimmermann Paul's message
	of "Sun, 12 Feb 2012 20:29:51 +0100")
References: <E1Rwcoj-0002hc-Py@merguez.loria.fr>
	<861uq0c712.fsf@shell.gmplib.org> <E1RwdTN-0003uZ-7r@merguez.loria.fr>
	<nnpqdj27o3.fsf@stalhein.lysator.liu.se>
	<E1Rwf7P-00078u-5H@merguez.loria.fr>
Message-ID: <nnlio70w6k.fsf@stalhein.lysator.liu.se>

Zimmermann Paul <Paul.Zimmermann at loria.fr> writes:

> I believe those sizes are too small. With a full inverse, we need that two
> short products (one mullo and one mulhi) are faster than redc_1 or
> redc_2. This does not seem to be the case for 14 limbs, where mpn_redc_n takes
> 0.45us, compared to 0.23us for mpn_redc_1:

I see.

>> And back to the original question: I guess you could try to completely
>> unroll the multiplication for some size of interest, and compare to the
>> general mpn_mul_basecase.

I think there's some potential for speed up of the linear term, which is
mostly relevant for small sizes. The addmul_1 calls can run at 3 cycles
per limb or so. But then the computing the quotient involves dependent
multiplications with longer latency, so one may be able to compute the
independent left and right quotient in less time than computing two
quotients at the same end. Unclear to me if that's going to make any
difference in real code, in particular since then left-to-right quotient
will require some kind of adjustment step.

> ok. Any speedup in squaring (mpn_sqr) is also most welcome, since our best
> ECM Stage 1 code performs 4 modular multiplications and 4 modular squaring
> per iteration.

What sizes are important?

I have started to look a little into elliptic curve cryptography, and
there the sizes are pretty small. E.g., Using the standard curve over a
256-bit prime means that numbers are just four limbs on a 64-bit
machine. So in this case, I'd expect a specialized a completely unrolled
squaring function for this size could make a real difference.

Regards,
/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From nisse at lysator.liu.se  Mon Feb 13 13:20:35 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Mon, 13 Feb 2012 13:20:35 +0100
Subject: toom54
Message-ID: <nnhayu2acs.fsf@stalhein.lysator.liu.se>

I noticed that toom54 is missing. And it's easy, since all the building
blocks already are in place. Patch below. Comments?

I'd expect it to have a place with toom43 and toom44 below, and toom6h
above, but I have no good guess on how large that place might be.

Also toom72 is missing, which would use the same interpolation function
as toom54 and toom63. Not sure how useful that might be.

Regards,
/Niels

diff -r b856752462ac configure.in
--- a/configure.in	Mon Feb 13 12:38:05 2012 +0100
+++ b/configure.in	Mon Feb 13 12:40:42 2012 +0100
@@ -2634,7 +2634,7 @@
   hgcd2_jacobi hgcd_jacobi						   \
   mullo_n mullo_basecase						   \
   toom22_mul toom32_mul toom42_mul toom52_mul toom62_mul		   \
-  toom33_mul toom43_mul toom53_mul toom63_mul				   \
+  toom33_mul toom43_mul toom53_mul toom54_mul toom63_mul		   \
   toom44_mul								   \
   toom6h_mul toom6_sqr toom8h_mul toom8_sqr				   \
   toom_couple_handling							   \
diff -r b856752462ac gmp-impl.h
--- a/gmp-impl.h	Mon Feb 13 12:38:05 2012 +0100
+++ b/gmp-impl.h	Mon Feb 13 12:40:42 2012 +0100
@@ -4983,6 +4983,13 @@
   return 9 * n + 3;
 }
 
+static inline mp_size_t
+mpn_toom54_mul_itch (mp_size_t an, mp_size_t bn)
+{
+  mp_size_t n = 1 + (4 * an >= 5 * bn ? (an - 1) / (size_t) 5 : (bn - 1) / (size_t) 4);
+  return 9 * n + 3;
+}
+
 /* let S(n) = space required for input size n,
    then S(n) = 3 floor(n/2) + 1 + S(floor(n/2)).   */
 #define mpn_toom42_mulmid_itch(n) \
diff -r b856752462ac mpn/generic/toom54_mul.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/mpn/generic/toom54_mul.c	Mon Feb 13 12:40:42 2012 +0100
@@ -0,0 +1,135 @@
+/* Implementation of the toom54 (same points as toom63)
+
+   Contributed to the GNU project by Niels M?ller.
+
+   THE FUNCTION IN THIS FILE IS INTERNAL WITH A MUTABLE INTERFACE.  IT IS ONLY
+   SAFE TO REACH IT THROUGH DOCUMENTED INTERFACES.  IN FACT, IT IS ALMOST
+   GUARANTEED THAT IT WILL CHANGE OR DISAPPEAR IN A FUTURE GNU MP RELEASE.
+
+Copyright 2009, 2012 Free Software Foundation, Inc.
+
+This file is part of the GNU MP Library.
+
+The GNU MP Library is free software; you can redistribute it and/or modify
+it under the terms of the GNU Lesser General Public License as published by
+the Free Software Foundation; either version 3 of the License, or (at your
+option) any later version.
+
+The GNU MP Library is distributed in the hope that it will be useful, but
+WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
+License for more details.
+
+You should have received a copy of the GNU Lesser General Public License
+along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.  */
+
+
+#include "gmp.h"
+#include "gmp-impl.h"
+
+
+/* Toom-4.5, the splitting 5x4 unbalanced version.
+   Evaluate in: infinity, +4, -4, +2, -2, +1, -1, 0.
+
+  <--s-><--n--><--n--><--n--><--n-->
+   ____ ______ ______ ______ ______ 
+  |_a4_|__a3__|__a2__|__a1__|__a0__|
+	  |b3_|__b2__|__b1__|__b0__|
+	  <-t-><--n--><--n--><--n-->
+
+*/
+#define TOOM_54_MUL_N_REC(p, a, b, n, ws)		\
+  do {	mpn_mul_n (p, a, b, n);				\
+  } while (0)
+
+#define TOOM_54_MUL_REC(p, a, na, b, nb, ws)		\
+  do {	mpn_mul (p, a, na, b, nb);			\
+  } while (0)
+
+void
+mpn_toom54_mul (mp_ptr pp,
+		mp_srcptr ap, mp_size_t an,
+		mp_srcptr bp, mp_size_t bn, mp_ptr scratch)
+{
+  mp_size_t n, s, t;
+  mp_limb_t cy;
+  int sign;
+
+  /***************************** decomposition *******************************/
+#define a4  (ap + 4 * n)
+#define b3  (bp + 3 * n)
+
+  ASSERT (an >= bn);
+  n = 1 + (4 * an >= 5 * bn ? (an - 1) / (size_t) 5 : (bn - 1) / (size_t) 4);
+
+  s = an - 4 * n;
+  t = bn - 3 * n;
+
+  ASSERT (0 < s && s <= n);
+  ASSERT (0 < t && t <= n);
+  /* WARNING! it assumes s+t>=n */
+  ASSERT ( s + t >= n );
+  ASSERT ( s + t > 4);
+  ASSERT ( n > 2);
+
+#define   r8    pp				/* 2n   */
+#define   r7    scratch				/* 3n+1 */
+#define   r5    (pp + 3*n)			/* 3n+1 */
+#define   v0    (pp + 3*n)			/* n+1 */
+#define   v1    (pp + 4*n+1)			/* n+1 */
+#define   v2    (pp + 5*n+2)			/* n+1 */
+#define   v3    (pp + 6*n+3)			/* n+1 */
+#define   r3    (scratch + 3 * n + 1)		/* 3n+1 */
+#define   r1    (pp + 7*n)			/* s+t <= 2*n */
+#define   ws    (scratch + 6 * n + 2)		/* ??? */
+
+  /* Alloc also 3n+1 limbs for ws... mpn_toom_interpolate_8pts may
+     need all of them, when DO_mpn_sublsh_n usea a scratch  */
+  /********************** evaluation and recursive calls *********************/
+  /* $\pm4$ */
+  sign  = mpn_toom_eval_pm2exp (v2, v0, 4, ap, n, s, 2, pp);
+  sign ^= mpn_toom_eval_pm2exp (v3, v1, 3, bp, n, t, 2, pp);
+  TOOM_54_MUL_N_REC(pp, v0, v1, n + 1, ws); /* A(-4)*B(-4) */
+  TOOM_54_MUL_N_REC(r3, v2, v3, n + 1, ws); /* A(+4)*B(+4) */
+  mpn_toom_couple_handling (r3, 2*n+1, pp, sign, n, 2, 4); /* FIXME: ...,2,4 ?*/
+
+  /* $\pm1$ */
+  sign  = mpn_toom_eval_pm1 (v2, v0, 4, ap, n, s,    pp);
+  sign ^= mpn_toom_eval_dgr3_pm1 (v3, v1, bp, n, t,    pp);
+  TOOM_54_MUL_N_REC(pp, v0, v1, n + 1, ws); /* A(-1)*B(-1) */
+  TOOM_54_MUL_N_REC(r7, v2, v3, n + 1, ws); /* A(1)*B(1) */
+  mpn_toom_couple_handling (r7, 2*n+1, pp, sign, n, 0, 0);
+
+  /* $\pm2$ */
+  sign  = mpn_toom_eval_pm2 (v2, v0, 4, ap, n, s, pp);
+  sign ^= mpn_toom_eval_dgr3_pm2 (v3, v1, bp, n, t, pp);
+  TOOM_54_MUL_N_REC(pp, v0, v1, n + 1, ws); /* A(-2)*B(-2) */
+  TOOM_54_MUL_N_REC(r5, v2, v3, n + 1, ws); /* A(+2)*B(+2) */
+  mpn_toom_couple_handling (r5, 2*n+1, pp, sign, n, 1, 2); /* FIXME: ...,1,2)? */
+
+  /* A(0)*B(0) */
+  TOOM_54_MUL_N_REC(pp, ap, bp, n, ws);
+
+  /* Infinity */
+  if (s > t) {
+    TOOM_54_MUL_REC(r1, a4, s, b3, t, ws);
+  } else {
+    TOOM_54_MUL_REC(r1, b3, t, a4, s, ws);
+  };
+
+  mpn_toom_interpolate_8pts (pp, n, r3, r7, s + t, ws);
+
+#undef a4
+#undef b3
+#undef r1
+#undef r3
+#undef r5
+#undef v0
+#undef v1
+#undef v2
+#undef v3
+#undef r7
+#undef r8
+#undef ws
+}
+  
diff -r b856752462ac tests/mpn/Makefile.am
--- a/tests/mpn/Makefile.am	Mon Feb 13 12:38:05 2012 +0100
+++ b/tests/mpn/Makefile.am	Mon Feb 13 12:40:42 2012 +0100
@@ -25,7 +25,7 @@
 check_PROGRAMS = t-asmtype t-aors_1 t-divrem_1 t-mod_1 t-fat t-get_d	\
   t-instrument t-iord_u t-mp_bases t-perfsqr t-scan			\
   t-toom22 t-toom32 t-toom33 t-toom42 t-toom43 t-toom44			\
-  t-toom52 t-toom53 t-toom62 t-toom63 t-toom6h t-toom8h			\
+  t-toom52 t-toom53 t-toom54 t-toom62 t-toom63 t-toom6h t-toom8h	\
   t-mul t-mullo t-mulmod_bnm1 t-sqrmod_bnm1 t-mulmid			\
   t-hgcd t-hgcd_appr t-matrix22 t-invert t-div t-bdiv
 
diff -r b856752462ac tests/mpn/t-toom54.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tests/mpn/t-toom54.c	Mon Feb 13 12:40:42 2012 +0100
@@ -0,0 +1,8 @@
+#define mpn_toomMN_mul mpn_toom54_mul
+#define mpn_toomMN_mul_itch mpn_toom54_mul_itch
+
+#define MIN_AN 31
+#define MIN_BN(an) ((3*(an) + 32) / (size_t) 5)		/* 3/5 */
+#define MAX_BN(an) ((an) - 6)	                        /* 1/1 */
+
+#include "toom-shared.h"




-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.


From tg at gmplib.org  Mon Feb 13 13:25:41 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Mon, 13 Feb 2012 13:25:41 +0100
Subject: fixed-size mpn_mul_n for small n?
In-Reply-To: <nnlio70w6k.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?= message of "Mon\, 13 Feb 2012 13\:12\:03
	+0100")
References: <E1Rwcoj-0002hc-Py@merguez.loria.fr>
	<861uq0c712.fsf@shell.gmplib.org> <E1RwdTN-0003uZ-7r@merguez.loria.fr>
	<nnpqdj27o3.fsf@stalhein.lysator.liu.se>
	<E1Rwf7P-00078u-5H@merguez.loria.fr>
	<nnlio70w6k.fsf@stalhein.lysator.liu.se>
Message-ID: <86pqdizzqy.fsf@shell.gmplib.org>

nisse at lysator.liu.se (Niels M?ller) writes:

  I have started to look a little into elliptic curve cryptography, and
  there the sizes are pretty small. E.g., Using the standard curve over a
  256-bit prime means that numbers are just four limbs on a 64-bit
  machine. So in this case, I'd expect a specialized a completely unrolled
  squaring function for this size could make a real difference.
  
The x86_64 sqr_basecase has special code for n <= 4.

-- 
Torbj?rn

From tg at gmplib.org  Mon Feb 13 13:39:23 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Mon, 13 Feb 2012 13:39:23 +0100
Subject: toom54
In-Reply-To: <nnhayu2acs.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?= message of "Mon\, 13 Feb 2012 13\:20\:35
	+0100")
References: <nnhayu2acs.fsf@stalhein.lysator.liu.se>
Message-ID: <86lio6zz44.fsf@shell.gmplib.org>

nisse at lysator.liu.se (Niels M?ller) writes:

  I noticed that toom54 is missing. And it's easy, since all the building
  blocks already are in place. Patch below. Comments?
  
  I'd expect it to have a place with toom43 and toom44 below, and toom6h
  above, but I have no good guess on how large that place might be.
  
  Also toom72 is missing, which would use the same interpolation function
  as toom54 and toom63. Not sure how useful that might be.
  
I am afraid Marco posted both a long time ago (2009?).  They live in
shell:~tege/gmp/mpn/generic/toom{54,72}_mul.c.

You might want to merge your versions.

The diagrams at https://gmplib.org/devel/ include timing for Marco's
functions.  It seems toom54 is quite useful, toom72 less so.  (These
diagrams are from 2009, things will have changed.)

The tricky part might be making good use of them in mul.c.  (I suppose
we never checked them in because we never fixed mul.c.)

-- 
Torbj?rn

From Paul.Zimmermann at loria.fr  Mon Feb 13 13:47:31 2012
From: Paul.Zimmermann at loria.fr (Zimmermann Paul)
Date: Mon, 13 Feb 2012 13:47:31 +0100
Subject: fixed-size mpn_mul_n for small n?
In-Reply-To: <nnlio70w6k.fsf@stalhein.lysator.liu.se> (nisse@lysator.liu.se)
References: <E1Rwcoj-0002hc-Py@merguez.loria.fr>
	<861uq0c712.fsf@shell.gmplib.org> <E1RwdTN-0003uZ-7r@merguez.loria.fr>
	<nnpqdj27o3.fsf@stalhein.lysator.liu.se>
	<E1Rwf7P-00078u-5H@merguez.loria.fr>
	<nnlio70w6k.fsf@stalhein.lysator.liu.se>
Message-ID: <E1RwvJb-0000N4-Jk@merguez.loria.fr>

       Niels,

> I think there's some potential for speed up of the linear term, which is
> mostly relevant for small sizes. The addmul_1 calls can run at 3 cycles
> per limb or so. But then the computing the quotient involves dependent
> multiplications with longer latency, so one may be able to compute the
> independent left and right quotient in less time than computing two
> quotients at the same end. Unclear to me if that's going to make any
> difference in real code, in particular since then left-to-right quotient
> will require some kind of adjustment step.

agreed. But to avoid this dependent multiplication, I believe
Montgomery-Svoboda has more potential. Instead of performing n addmul_1 calls
with N, you perform n-1 call with k*N such that the low limb of k*N is -1
(thus the quotient selection is trivial) and one call with N. Here is a
reference C implementation (not tested), where {sp, nn} = (k*N+1)/B:

static void
ecm_redc_1_svoboda (mp_ptr rp, mp_ptr tmp, mp_srcptr np, mp_size_t nn,
                    mp_limb_t invm, mp_srcptr sp)
{
  mp_size_t j;
  mp_limb_t t0, cy;

  /* instead of adding {np, nn} * (invm * tmp[0] mod B), we add
     {sp, nn} * tmp[0], where {np, nn} * invm = B * {sp, nn} - 1 */
  for (j = 0; j < nn - 1; j++, tmp++)
    rp[j + 1] = mpn_addmul_1 (tmp + 1, sp, nn, tmp[0]);
  /* for the last step, we reduce with {np, nn} */
  t0 = mpn_addmul_1 (tmp, np, nn, tmp[0] * invm);
  tmp ++;

  rp[0] = tmp[0];
  cy = mpn_add_n (rp + 1, rp + 1, tmp + 1, nn - 1);
  rp[nn-1] += t0;
  cy += rp[nn-1] < t0;
  if (cy != 0)
    mpn_sub_n (rp, rp, np, nn); /* a borrow should always occur here */
}

Of course the same idea could be applied to redc_2.

> What sizes are important?
> 
> I have started to look a little into elliptic curve cryptography, and
> there the sizes are pretty small. E.g., Using the standard curve over a
> 256-bit prime means that numbers are just four limbs on a 64-bit
> machine. So in this case, I'd expect a specialized a completely unrolled
> squaring function for this size could make a real difference.

for GMP-ECM, the most important range is from 10 to 20 limbs, i.e., about 200
to 400 decimal digits.

Paul

From nisse at lysator.liu.se  Mon Feb 13 14:08:47 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Mon, 13 Feb 2012 14:08:47 +0100
Subject: toom54
In-Reply-To: <86lio6zz44.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Mon, 13 Feb 2012 13:39:23 +0100")
References: <nnhayu2acs.fsf@stalhein.lysator.liu.se>
	<86lio6zz44.fsf@shell.gmplib.org>
Message-ID: <nnd39i284g.fsf@stalhein.lysator.liu.se>

Torbjorn Granlund <tg at gmplib.org> writes:

> I am afraid Marco posted both a long time ago (2009?).  They live in
> shell:~tege/gmp/mpn/generic/toom{54,72}_mul.c.

Ah. That version is virtually identical (not surprising, given that both
versions are intimately related to the same toom63_mul.c). Just some
different names for the helper functions, and ASSERT (s + t > n) vs
ASSERT (s + t >= n), related to your recent change of
mpn_toom_interpolate_8pts.

> The diagrams at https://gmplib.org/devel/ include timing for Marco's
> functions.  It seems toom54 is quite useful, toom72 less so.  (These
> diagrams are from 2009, things will have changed.)

Should I push in toom54 then? (Naturally, Marco should have the credit).

> The tricky part might be making good use of them in mul.c.  (I suppose
> we never checked them in because we never fixed mul.c.)

toom52 and toom62 are also unused. Which reminds me that I should
correct the toom63 row in the diagram in mul.c.

/nisse
-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From nisse at lysator.liu.se  Mon Feb 13 15:07:45 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Mon, 13 Feb 2012 15:07:45 +0100
Subject: fixed-size mpn_mul_n for small n?
In-Reply-To: <E1RwvJb-0000N4-Jk@merguez.loria.fr> (Zimmermann Paul's message
	of "Mon, 13 Feb 2012 13:47:31 +0100")
References: <E1Rwcoj-0002hc-Py@merguez.loria.fr>
	<861uq0c712.fsf@shell.gmplib.org> <E1RwdTN-0003uZ-7r@merguez.loria.fr>
	<nnpqdj27o3.fsf@stalhein.lysator.liu.se>
	<E1Rwf7P-00078u-5H@merguez.loria.fr>
	<nnlio70w6k.fsf@stalhein.lysator.liu.se>
	<E1RwvJb-0000N4-Jk@merguez.loria.fr>
Message-ID: <nn8vk625e6.fsf@stalhein.lysator.liu.se>

Zimmermann Paul <Paul.Zimmermann at loria.fr> writes:

> agreed. But to avoid this dependent multiplication, I believe
> Montgomery-Svoboda has more potential. Instead of performing n addmul_1 calls
> with N, you perform n-1 call with k*N such that the low limb of k*N is -1
> (thus the quotient selection is trivial) and one call with N.

Ah, that's clever. I wasn't aware of that method.

> for GMP-ECM, the most important range is from 10 to 20 limbs, i.e., about 200
> to 400 decimal digits.

Using completely unrolled code for all these sizes (I'm now primarily
thinking of squaring) seems a bit impractical. Maybe one could do
something reasonable with specialcase code for collecting the
off-diagonal terms (long code with jumpts into), and then a plain loop
for the diagonal and the final shift + add.

/nisse

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From bodrato at mail.dm.unipi.it  Mon Feb 13 17:44:10 2012
From: bodrato at mail.dm.unipi.it (bodrato at mail.dm.unipi.it)
Date: Mon, 13 Feb 2012 17:44:10 +0100 (CET)
Subject: toom54
In-Reply-To: <nnd39i284g.fsf@stalhein.lysator.liu.se>
References: <nnhayu2acs.fsf@stalhein.lysator.liu.se>
	<86lio6zz44.fsf@shell.gmplib.org>
	<nnd39i284g.fsf@stalhein.lysator.liu.se>
Message-ID: <49377.151.32.244.160.1329151450.squirrel@mail.dm.unipi.it>

Ciao,

Il Lun, 13 Febbraio 2012 2:08 pm, Niels M?ller ha scritto:
> Torbjorn Granlund <tg at gmplib.org> writes:

>> shell:~tege/gmp/mpn/generic/toom{54,72}_mul.c.
>
> Ah. That version is virtually identical (not surprising, given that both
> versions are intimately related to the same toom63_mul.c). Just some

Yes, Toom-4.5 inversion is structured with the Toom'n'half strategy. It
shouldn't be difficult to write a single function working both as 54 and
63 :-)
For those people with no access to shell, my (old) code is available on my
site: http://bodrato.it/software/toom.html#TC4.5 .

>> The diagrams at https://gmplib.org/devel/ include timing for Marco's
>> functions.  It seems toom54 is quite useful, toom72 less so.  (These
>> diagrams are from 2009, things will have changed.)

Yes, things have changed, the main such change comes from the new toom6h
and toom8h functions. It would be nice to regenerate the diagrams with the
new algorithms. I guess toom72 will not cover a wide region in a current
version.

> toom52 and toom62 are also unused. Which reminds me that I should
> correct the toom63 row in the diagram in mul.c.

And that diagram reminds me that the unbalancement capability of toom6h
and toom8h are still unused...

Regards,
m

-- 
http://bodrato.it/toom-cook/


From Paul.Zimmermann at loria.fr  Mon Feb 13 19:00:43 2012
From: Paul.Zimmermann at loria.fr (Zimmermann Paul)
Date: Mon, 13 Feb 2012 19:00:43 +0100
Subject: fixed-size mpn_mul_n for small n?
In-Reply-To: <nn8vk625e6.fsf@stalhein.lysator.liu.se> (nisse@lysator.liu.se)
References: <E1Rwcoj-0002hc-Py@merguez.loria.fr>
	<861uq0c712.fsf@shell.gmplib.org> <E1RwdTN-0003uZ-7r@merguez.loria.fr>
	<nnpqdj27o3.fsf@stalhein.lysator.liu.se>
	<E1Rwf7P-00078u-5H@merguez.loria.fr>
	<nnlio70w6k.fsf@stalhein.lysator.liu.se>
	<E1RwvJb-0000N4-Jk@merguez.loria.fr>
	<nn8vk625e6.fsf@stalhein.lysator.liu.se>
Message-ID: <E1Rx0Ch-0002qz-22@merguez.loria.fr>

       Niels,

> > agreed. But to avoid this dependent multiplication, I believe
> > Montgomery-Svoboda has more potential. Instead of performing n addmul_1 calls
> > with N, you perform n-1 call with k*N such that the low limb of k*N is -1
> > (thus the quotient selection is trivial) and one call with N.
> 
> Ah, that's clever. I wasn't aware of that method.

the subquadratic version is described in [1], Algorithm 2.8. However I only
recently figured out that with (kN+1)/B precomputed, you only need an addmul_1
of length n (instead of n+1) at each step for the quadratic version.

> > for GMP-ECM, the most important range is from 10 to 20 limbs, i.e., about 200
> > to 400 decimal digits.

for RSA signature and encryption, 16 and 32 limbs are important targets too.

> Using completely unrolled code for all these sizes (I'm now primarily
> thinking of squaring) seems a bit impractical. Maybe one could do
> something reasonable with specialcase code for collecting the
> off-diagonal terms (long code with jumpts into), and then a plain loop
> for the diagonal and the final shift + add.

I found a variant, but I'm not sure it is better:

1) first put the diagonal terms in place (this will fill exactly the 2n buffer)
2) divide by 2 (if the input is odd, store the carry out)
3) accumulate the off-diagonal terms (could be done in assembly as you suggest)
4) multiply by 2 (and restore the carry out)
5) perform the usual reduction

Paul

[1] Modern Computer Arithmetic, Richard Brent and Paul Zimmermann, Cambridge University Press, 2010, available online.

From nisse at lysator.liu.se  Mon Feb 13 21:38:26 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Mon, 13 Feb 2012 21:38:26 +0100
Subject: toom54
In-Reply-To: <49377.151.32.244.160.1329151450.squirrel@mail.dm.unipi.it>
	(bodrato@mail.dm.unipi.it's message of "Mon, 13 Feb 2012 17:44:10
	+0100 (CET)")
References: <nnhayu2acs.fsf@stalhein.lysator.liu.se>
	<86lio6zz44.fsf@shell.gmplib.org>
	<nnd39i284g.fsf@stalhein.lysator.liu.se>
	<49377.151.32.244.160.1329151450.squirrel@mail.dm.unipi.it>
Message-ID: <nn4nuu1nb1.fsf@stalhein.lysator.liu.se>

bodrato at mail.dm.unipi.it writes:

> Yes, Toom-4.5 inversion is structured with the Toom'n'half strategy. It
> shouldn't be difficult to write a single function working both as 54 and
> 63 :-)

That's a question of taste. I'd prefer separate functions, and then
leave choice of stratey to mpn_mul. But which style really makes the
most sense depends a bit on where relevant thresholds end up.

toom54 is really a small and simply function now, the interesting piece
is just 35 lines. The reason toom63 is a bit larger is that evaluation
of the degree 2 polynomial (3 coefficients) is inlined.

One could save some code size by writing some helper functions for this
case as well (shared by toom[3456]3 and possibly also toom32). Or if the
function call overhead is too expensive for toom33 and toom32,
mpn_toom_eval_dgr2_pm1 could be done as a macro rather than a function,
even if that would just reduce source code size, not object code size.

> Yes, things have changed, the main such change comes from the new toom6h
> and toom8h functions. It would be nice to regenerate the diagrams with the
> new algorithms. I guess toom72 will not cover a wide region in a current
> version.

I think it might provide some insight to do benchmarks along fixed
ratios. For each algorithm, there's a supported range. Take both end
points and the midpoint. Along each of these three lines, benchmark for
a range of sizes with a fixed ration, and see where the algorithm beats
other algorithms which support the same ratio. There's some complication
from the unbalanced calls (which one would want to use an optimal
algorithm choice), but I hope that will have a fairly small influence on
the results.

It would also be interesting to prepare some graph showing for each
function which functions it may use for the recursive calls (with a
close-to-optimal algorithm choice). E.g., I suspect toom32 shouldn't
call anything but basecase, toom32 and toom22.

Regards,
/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From tg at gmplib.org  Tue Feb 14 20:09:24 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Tue, 14 Feb 2012 20:09:24 +0100
Subject: toom54
In-Reply-To: <nn4nuu1nb1.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?= message of "Mon\, 13 Feb 2012 21\:38\:26
	+0100")
References: <nnhayu2acs.fsf@stalhein.lysator.liu.se>
	<86lio6zz44.fsf@shell.gmplib.org>
	<nnd39i284g.fsf@stalhein.lysator.liu.se>
	<49377.151.32.244.160.1329151450.squirrel@mail.dm.unipi.it>
	<nn4nuu1nb1.fsf@stalhein.lysator.liu.se>
Message-ID: <86vcn9w7tn.fsf@shell.gmplib.org>

nisse at lysator.liu.se (Niels M?ller) writes:

  I think it might provide some insight to do benchmarks along fixed
  ratios. For each algorithm, there's a supported range. Take both end
  points and the midpoint. Along each of these three lines, benchmark for
  a range of sizes with a fixed ration, and see where the algorithm beats
  other algorithms which support the same ratio. There's some complication
  from the unbalanced calls (which one would want to use an optimal
  algorithm choice), but I hope that will have a fairly small influence on
  the results.
  
  It would also be interesting to prepare some graph showing for each
  function which functions it may use for the recursive calls (with a
  close-to-optimal algorithm choice). E.g., I suspect toom32 shouldn't
  call anything but basecase, toom32 and toom22.
  
Not so sure.  But we can decide to make it that.  It currently is
fastest for large operands, in a *narrow* space between 43 and 53, and
its subproducts will want 33.

To enable 54 in today's framework, it should probably be here:

	      else if (2 * un < 3 * vn)
		{
		  if (BELOW_THRESHOLD (vn, MUL_TOOM32_TO_TOOM43_THRESHOLD))
		    mpn_toom32_mul (prodp, up, un, vp, vn, scratch);
!		  else if (BELOW_THRESHOLD (vn, MUL_TOOM43_TO_TOOM54_THRESHOLD))
		    mpn_toom43_mul (prodp, up, un, vp, vn, scratch);
!		  else
!		    mpn_toom54_mul (prodp, up, un, vp, vn, scratch);
		}

But first MUL_TOOM43_TO_TOOM54_THRESHOLD needs to be measured and set to
some default.

-- 
Torbj?rn

From nisse at lysator.liu.se  Wed Feb 15 10:13:06 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Wed, 15 Feb 2012 10:13:06 +0100
Subject: toom54
In-Reply-To: <86vcn9w7tn.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Tue, 14 Feb 2012 20:09:24 +0100")
References: <nnhayu2acs.fsf@stalhein.lysator.liu.se>
	<86lio6zz44.fsf@shell.gmplib.org>
	<nnd39i284g.fsf@stalhein.lysator.liu.se>
	<49377.151.32.244.160.1329151450.squirrel@mail.dm.unipi.it>
	<nn4nuu1nb1.fsf@stalhein.lysator.liu.se>
	<86vcn9w7tn.fsf@shell.gmplib.org>
Message-ID: <nn8vk4zcgt.fsf@stalhein.lysator.liu.se>

Torbjorn Granlund <tg at gmplib.org> writes:

> Not so sure.  But we can decide to make it that.  It currently is
> fastest for large operands, in a *narrow* space between 43 and 53, and
> its subproducts will want 33.

If so, we have an apparent cycle in the graph,

          toom33
         ^      \
        /        \
       V          V
   toom32  <--->  toom22

For the range where toom32 is (or should be) used, its calls to toom33
shouldn't generate any calls to higher tooms(?). And when toom22 calls
toom32, it also shouldn't call toom33. Hence, tha above subgraph is for
calls to toom32, while for a call to toom22, we only have

   toom32  <--->  toom22

Any moving up, in the general case, I imagine toom33 might at least call
toom43? It's going to be a fairly complicated graph.

I'm also thinking about how to do the itch functions for these
functions, taking all recursive calls into account. I still think we
want closed formulas for the lowest tooms, while that's most likely
*not* practical for the higher ones.

> But first MUL_TOOM43_TO_TOOM54_THRESHOLD needs to be measured and set
> to some default.

I've started to look at tuning. I don't yet understand how the
TOOMX_TO_TOOMY tuning works.

I started by adding MPN_TOOM54_MUL_MINSIZE, and then I noticed that the
corresponding values for toom43 and toom53 are commented with "???" in
gmp-impl.h. Is it correct to use the same value as MIN_AN in the
corresponding testfiles? If so, I think this is the right patch:

diff -r 605ce4a6238b gmp-impl.h
--- a/gmp-impl.h     Tue Feb 14 21:41:21 2012 +0100
+++ b/gmp-impl.h     Wed Feb 15 10:04:32 2012 +0100
@@ -1247,8 +1247,9 @@ typedef struct {
 
 #define MPN_TOOM32_MUL_MINSIZE   10
 #define MPN_TOOM42_MUL_MINSIZE   10
-#define MPN_TOOM43_MUL_MINSIZE   49 /* ??? */
-#define MPN_TOOM53_MUL_MINSIZE   49 /* ??? */
+#define MPN_TOOM43_MUL_MINSIZE   25
+#define MPN_TOOM53_MUL_MINSIZE   17
+#define MPN_TOOM54_MUL_MINSIZE   31
 #define MPN_TOOM63_MUL_MINSIZE   49
 
 #define MPN_TOOM42_MULMID_MINSIZE    4

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From nisse at lysator.liu.se  Wed Feb 15 15:08:54 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Wed, 15 Feb 2012 15:08:54 +0100
Subject: toom54
In-Reply-To: <86vcn9w7tn.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Tue, 14 Feb 2012 20:09:24 +0100")
References: <nnhayu2acs.fsf@stalhein.lysator.liu.se>
	<86lio6zz44.fsf@shell.gmplib.org>
	<nnd39i284g.fsf@stalhein.lysator.liu.se>
	<49377.151.32.244.160.1329151450.squirrel@mail.dm.unipi.it>
	<nn4nuu1nb1.fsf@stalhein.lysator.liu.se>
	<86vcn9w7tn.fsf@shell.gmplib.org>
Message-ID: <nn4nusyyrt.fsf@stalhein.lysator.liu.se>

Torbjorn Granlund <tg at gmplib.org> writes:

> But first MUL_TOOM43_TO_TOOM54_THRESHOLD needs to be measured and set to
> some default.

I have pushed in an attempt at tuning code. Please have a look to see if
makes sense. I measure a line with an operand size ratio 5/6. On this
core2 machine I get

#define MUL_TOOM43_TO_TOOM54_THRESHOLD     100

I set the default to 150, since other tuned thresholds seem to be a bit
smaller than the defaults.

/nisse

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From tg at gmplib.org  Thu Feb 16 07:54:02 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Thu, 16 Feb 2012 07:54:02 +0100
Subject: toom54
In-Reply-To: <nn4nusyyrt.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?= message of "Wed\, 15 Feb 2012 15\:08\:54
	+0100")
References: <nnhayu2acs.fsf@stalhein.lysator.liu.se>
	<86lio6zz44.fsf@shell.gmplib.org>
	<nnd39i284g.fsf@stalhein.lysator.liu.se>
	<49377.151.32.244.160.1329151450.squirrel@mail.dm.unipi.it>
	<nn4nuu1nb1.fsf@stalhein.lysator.liu.se>
	<86vcn9w7tn.fsf@shell.gmplib.org>
	<nn4nusyyrt.fsf@stalhein.lysator.liu.se>
Message-ID: <86zkcjuv3p.fsf@shell.gmplib.org>

nisse at lysator.liu.se (Niels M?ller) writes:

  > But first MUL_TOOM43_TO_TOOM54_THRESHOLD needs to be measured and set to
  > some default.
  
  I have pushed in an attempt at tuning code. Please have a look to see if
  makes sense. I measure a line with an operand size ratio 5/6. On this
  core2 machine I get
  
  #define MUL_TOOM43_TO_TOOM54_THRESHOLD     100
  
  I set the default to 150, since other tuned thresholds seem to be a bit
  smaller than the defaults.
  
I haven't yet admired the code, but magically it is now tabled with the
other thtresholds:

http://gmplib.org/devel/thresholds.html

-- 
Torbj?rn

From nisse at lysator.liu.se  Thu Feb 16 09:42:23 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Thu, 16 Feb 2012 09:42:23 +0100
Subject: toom54
In-Reply-To: <86zkcjuv3p.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Thu, 16 Feb 2012 07:54:02 +0100")
References: <nnhayu2acs.fsf@stalhein.lysator.liu.se>
	<86lio6zz44.fsf@shell.gmplib.org>
	<nnd39i284g.fsf@stalhein.lysator.liu.se>
	<49377.151.32.244.160.1329151450.squirrel@mail.dm.unipi.it>
	<nn4nuu1nb1.fsf@stalhein.lysator.liu.se>
	<86vcn9w7tn.fsf@shell.gmplib.org>
	<nn4nusyyrt.fsf@stalhein.lysator.liu.se>
	<86zkcjuv3p.fsf@shell.gmplib.org>
Message-ID: <nnipj7xj80.fsf@stalhein.lysator.liu.se>

Torbjorn Granlund <tg at gmplib.org> writes:

> I haven't yet admired the code, but magically it is now tabled with the
> other thtresholds:
>
> http://gmplib.org/devel/thresholds.html

Nice. Median 124, compared to 85 for TOOM32_TO_TOOM43, and 96 for
TOOM42_TO_TOOM63.

/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From nisse at lysator.liu.se  Thu Feb 16 12:03:04 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Thu, 16 Feb 2012 12:03:04 +0100
Subject: Status update: mini-gmp
In-Reply-To: <86k44qnrzk.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Tue, 17 Jan 2012 18:44:47 +0100")
References: <nn62h0dbs5.fsf@stalhein.lysator.liu.se>
	<nn62gby8lr.fsf@stalhein.lysator.liu.se>
	<86k44qnrzk.fsf@shell.gmplib.org>
Message-ID: <nnehtvxcpj.fsf@stalhein.lysator.liu.se>

Torbjorn Granlund <tg at gmplib.org> writes:

> It would be nicer to have it in a separate directory, both for GMP
> directory cleanness and for easy extraction by users.

Right, I guess that makes the most sense.

After copying the mini-gmp directory, I can build gmp with the attached
patch. For the functions in dumbmp.c:

* Most aren't needed any more.

* Functions used in only one file are copied to that file.

* The remaining few functions are moved to a new file, named
  mini-gmp-extra.c, at the gmp top-level.

Functions which get by with just mini-gmp include mini-gmp/mini-gmp.c,
the rest instead include mini-gmp-extra.c.

I did some other minor changes to the files: Use assert rather than
ASSERT. Use memmove rather than mem_copyi. Deleted casts of the return
value from xmalloc, instead expecting the return value to be of type
void *. In mini-gmp-extra.c, I defined xmalloc as an alias for
gmp_default_xalloc, an internal function in mini-gmp.c, rather than
including yet another copy of the same thing.

I'm not sure how to best do the automakery to get mini-gmp included in
the gmp distribution. Is it good enough to just add the mini-gmp
directory to EXTRA_DIST? Or do we need a list or glob pattern for the
wanted files somewhere? I'd like to avoid that backup files, build
products etc from the mini-gmp directory are picked up by accident, and
I'm not sure how smart the automake rules are for directories listed in
EXTRA_DIST.

> From GMP's perspective, I don't think mini-gmp unit testing should be
> necessary.

I see. But it would be nice with a top-level make targat check-mini-gmp,
to build and test mini-gmp with the same configuration (compiler,
builddir, etc) as set up for gmp.

Regards,
/Niels

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: mini-gmp.patch.2
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20120216/9ce351cf/attachment.ksh>
-------------- next part --------------

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From tg at gmplib.org  Thu Feb 16 12:11:19 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Thu, 16 Feb 2012 12:11:19 +0100
Subject: Status update: mini-gmp
In-Reply-To: <nnehtvxcpj.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?= message of "Thu\, 16 Feb 2012 12\:03\:04
	+0100")
References: <nn62h0dbs5.fsf@stalhein.lysator.liu.se>
	<nn62gby8lr.fsf@stalhein.lysator.liu.se>
	<86k44qnrzk.fsf@shell.gmplib.org>
	<nnehtvxcpj.fsf@stalhein.lysator.liu.se>
Message-ID: <86ipj7hw2w.fsf@shell.gmplib.org>

nisse at lysator.liu.se (Niels M?ller) writes:

  After copying the mini-gmp directory, I can build gmp with the attached
  patch. For the functions in dumbmp.c:
  
Nice!

  * Most aren't needed any more.
  
  * Functions used in only one file are copied to that file.
  
  * The remaining few functions are moved to a new file, named
    mini-gmp-extra.c, at the gmp top-level.
  
Perhaps a better name would be bootstrap.c?

  Functions which get by with just mini-gmp include mini-gmp/mini-gmp.c,
  the rest instead include mini-gmp-extra.c.
  
I suppose that's not a truly necessary tweak.  Using the strategy of
including bootstrap.c/mini-gmp-extra.c in all files might be a bit
cleaner.

  I did some other minor changes to the files: Use assert rather than
  ASSERT. Use memmove rather than mem_copyi. Deleted casts of the return
  value from xmalloc, instead expecting the return value to be of type
  void *. In mini-gmp-extra.c, I defined xmalloc as an alias for
  gmp_default_xalloc, an internal function in mini-gmp.c, rather than
  including yet another copy of the same thing.
  
OK.

  I'm not sure how to best do the automakery to get mini-gmp included in
  the gmp distribution. Is it good enough to just add the mini-gmp
  directory to EXTRA_DIST? Or do we need a list or glob pattern for the
  wanted files somewhere?

I don't recall, I do this to seldomly, please make some experiments.

  > From GMP's perspective, I don't think mini-gmp unit testing should be
  > necessary.
  
  I see. But it would be nice with a top-level make targat check-mini-gmp,
  to build and test mini-gmp with the same configuration (compiler,
  builddir, etc) as set up for gmp.
  
Makes sense.


-- 
Torbj?rn

From nisse at lysator.liu.se  Thu Feb 16 12:23:33 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Thu, 16 Feb 2012 12:23:33 +0100
Subject: Status update: mini-gmp
In-Reply-To: <86ipj7hw2w.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Thu, 16 Feb 2012 12:11:19 +0100")
References: <nn62h0dbs5.fsf@stalhein.lysator.liu.se>
	<nn62gby8lr.fsf@stalhein.lysator.liu.se>
	<86k44qnrzk.fsf@shell.gmplib.org>
	<nnehtvxcpj.fsf@stalhein.lysator.liu.se>
	<86ipj7hw2w.fsf@shell.gmplib.org>
Message-ID: <nnaa4jxbre.fsf@stalhein.lysator.liu.se>

Torbjorn Granlund <tg at gmplib.org> writes:

> Perhaps a better name would be bootstrap.c?

Makes some sense. At least it's shorter.

> I suppose that's not a truly necessary tweak.  Using the strategy of
> including bootstrap.c/mini-gmp-extra.c in all files might be a bit
> cleaner.

Depends on whether or not we aim to eliminate that extra file and only
use mini-gmp.c. If not, then I agree it's a bit cleaner to include the
bootstrap.c file everywhere, and then we can also be less zealous about
moving definitions out from that file.

And one correction, I wrote:

>   In mini-gmp-extra.c, I defined xmalloc as an alias for
>   gmp_default_xalloc, an internal function in mini-gmp.c, rather than
>   including yet another copy of the same thing.

I think that's a sensible thing to do, but that change wasn't actually
included in the posted patch.

Regards,
/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From tg at gmplib.org  Thu Feb 16 18:06:53 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Thu, 16 Feb 2012 18:06:53 +0100
Subject: toom54
In-Reply-To: <nn4nusyyrt.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?= message of "Wed\, 15 Feb 2012 15\:08\:54
	+0100")
References: <nnhayu2acs.fsf@stalhein.lysator.liu.se>
	<86lio6zz44.fsf@shell.gmplib.org>
	<nnd39i284g.fsf@stalhein.lysator.liu.se>
	<49377.151.32.244.160.1329151450.squirrel@mail.dm.unipi.it>
	<nn4nuu1nb1.fsf@stalhein.lysator.liu.se>
	<86vcn9w7tn.fsf@shell.gmplib.org>
	<nn4nusyyrt.fsf@stalhein.lysator.liu.se>
Message-ID: <86ty2q672q.fsf@shell.gmplib.org>

nisse at lysator.liu.se (Niels M?ller) writes:

  I have pushed in an attempt at tuning code. Please have a look to see if
  makes sense. I measure a line with an operand size ratio 5/6. On this
  core2 machine I get
  
It looks similar to my old code, which I no longer understand, so it
must be right.  :-)

-- 
Torbj?rn

From marc.glisse at inria.fr  Thu Feb 16 22:24:11 2012
From: marc.glisse at inria.fr (Marc Glisse)
Date: Thu, 16 Feb 2012 22:24:11 +0100 (CET)
Subject: g++-3.4 bug
Message-ID: <alpine.DEB.2.02.1202162206320.2560@laptop-mg.saclay.inria.fr>

Hello,

some tests currently fail on gmpxx with g++-3.4 (on shell). This is due to 
a bug in g++-3.4, which for l=2 says the following is true:
__builtin_constant_p(l) && (l == 0)
it is interesting to insert a printf statement that prints both l and l==0 
and have it print 2 and true :-/

More recent versions like g++42 seem fine. There are various simple 
workarounds that make g++34 happy (although in principle they shouldn't 
change anything), but that seems a bit dangerous (who knows where the bug 
might be lurking?). My current plan is to replace the test for the 
existence of __builtin_constant_p from __GNUC__ >= 2 to 
__GMP_GNUC_PREREQ(4, 2) (or possibly 4.0 or 4.1 if I can get access to 
either to check if the bug is present). This would only affect the use of 
__builtin_constant_p in gmpxx.h (not anywhere else, in particular not 
gmp.h), which is new in gmp-5.1.

Does that sound ok?

-- 
Marc Glisse

From nisse at lysator.liu.se  Fri Feb 17 11:29:09 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Fri, 17 Feb 2012 11:29:09 +0100
Subject: mini-gmp checkin?
Message-ID: <nnlio1wy6i.fsf@stalhein.lysator.liu.se>

I've had a look at the dist machinery. Adding a directory to EXTRA_DIST
copies the directory and *all* files. Then GMP uses dist-hook to clean
up a bit. I ended up adding a couple of the files to EXTRA_DIST, and
then a line in dist-hook to also copy mini-gmp/tests/*.[ch]. Seems to
work fine, make dist appears to pick up the right files, and make
distcheck works.

Complete patch is rather large, so I put it at
http://www.lysator.liu.se/~nisse/misc/mini-gmp.patch3

I settled for the name bootstrap.c. Otherwise the changes are about the
same as in the previous patch.

I'm about to check this in. Ok?

regards,
/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.


From bodrato at mail.dm.unipi.it  Sat Feb 18 17:55:57 2012
From: bodrato at mail.dm.unipi.it (bodrato at mail.dm.unipi.it)
Date: Sat, 18 Feb 2012 17:55:57 +0100 (CET)
Subject: mini-gmp checkin?
In-Reply-To: <nnlio1wy6i.fsf@stalhein.lysator.liu.se>
References: <nnlio1wy6i.fsf@stalhein.lysator.liu.se>
Message-ID: <49248.151.32.167.136.1329584157.squirrel@mail.dm.unipi.it>

Ciao,

Il Ven, 17 Febbraio 2012 11:29 am, Niels M?ller ha scritto:
> I've had a look at the dist machinery. Adding a directory to EXTRA_DIST
> copies the directory and *all* files. Then GMP uses dist-hook to clean
> up a bit. I ended up adding a couple of the files to EXTRA_DIST, and
> then a line in dist-hook to also copy mini-gmp/tests/*.[ch]. Seems to

Sounds a bit tricky to me, but it probably is the cleanest way if we want
to avoid recursive Makefiles in mini-gmp/

> I'm about to check this in. Ok?

Some changes may be needed to mini-gmp, but they can be delayed after the
first check-in.

E.g. mpn_invert_3by2 and mpn_invert_limb are defined in mini-gmp.h, but
they probably should not.

Regards,
m

-- 
http://bodrato.it/


From nisse at lysator.liu.se  Mon Feb 20 10:19:03 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Mon, 20 Feb 2012 10:19:03 +0100
Subject: About mpn_sqr_basecase
Message-ID: <nn4nulx3p4.fsf@stalhein.lysator.liu.se>

I've been looking a bit on mpn_sqr_basecase.

1. It uses a temporary array, apparently for systems lacking an
   mpn_sqr_diag_addlsh1 which can do in-place operation. I think one can
   support arbitrary sizes with a small temporary area if one collects
   the off-diagonal terms into the result area, and then computes the
   diagonal products blockwise, a hundred limbs at a time or so.

   Current code uses the opposite allocation, with diagonal terms in
   the result area and the off-diagonal terms in the temporary array.

2. One could do the shifting differently, applying the shift to the limb
   argument of addmul_1. Something like, when doing the off-diagonal
   products for up[i],

     mpn_addmul_1 (rp + 2*i+1, up + i + 1, n - i - 1,
		   (up[i] << 1) + (up[i-1] >> (GMP_LIMB_BITS - 1)));

   Might be cheaper, if we can get this shifting done in parallel with
   other operations, and get a simpler carry propagation recurrency when
   adding diagonal and off-diagonal terms together.
   
3. The comments on using addmul_2 says that is is tricky. I think that's
   because the diagonal terms are still 1x1 products. That would get
   simpler of one treats double-limbs as the units everywhere, having
   the diagonal computation form the 2x2 products

     (u0, u1)^2, (u2, u3)^2, ...

   The total number of limb products ought to be the same, if we do each
   of these terms as

     (u0 + u1 B)^2 = u0^2 + B^2 u1^2 + 2B (u0 * u1)
     	           = u0^2 + B^2 u1^2 + B (u0 << 1) * u1 + B^2 u1 & HIGH_BIT_TO_MASK(u0)

   The off-diagonal terms, to be computed with addmul_2, are then

     (u0, u1) * (u2, u3, ...)
             (u2, u3) * (u4, u5, ...)
             ...

   I guess one can also collect the close-to-diagonal terms B u0 u1 +
   B^5 u2 u3 + ..., together with the other off-diagonal terms,
   One would then get the off-diagonal sum

     B u0*u1 + B^2 (u0, u1) * (u2, u3, ...)
       B^5 u2*u3 + B^6 (u2, u3) * (u4, u5, ...)

   which begs for an mpn_addmul_1c accepting an additional carry input
   limb. Is there such a function?

4. There's code to use mpn_addmul_2s. What is that function supposed to
   do, is it doing the above sum?

Regards,
/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.


From tg at gmplib.org  Mon Feb 20 10:40:38 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Mon, 20 Feb 2012 10:40:38 +0100
Subject: About mpn_sqr_basecase
In-Reply-To: <nn4nulx3p4.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?= message of "Mon\, 20 Feb 2012 10\:19\:03
	+0100")
References: <nn4nulx3p4.fsf@stalhein.lysator.liu.se>
Message-ID: <86linxc06h.fsf@shell.gmplib.org>

nisse at lysator.liu.se (Niels M?ller) writes:

  1. It uses a temporary array, apparently for systems lacking an
     mpn_sqr_diag_addlsh1 which can do in-place operation. I think one can
     support arbitrary sizes with a small temporary area if one collects
     the off-diagonal terms into the result area, and then computes the
     diagonal products blockwise, a hundred limbs at a time or so.
  
Why would one want to support such large sizes?

     Current code uses the opposite allocation, with diagonal terms in
     the result area and the off-diagonal terms in the temporary array.
  
Old x86 code (p6, k6, k7) might do it like that.  The 9 other assembly
files for sqr_basecase handle arbitrary operands.

  2. One could do the shifting differently, applying the shift to the limb
     argument of addmul_1. Something like, when doing the off-diagonal
     products for up[i],
  
       mpn_addmul_1 (rp + 2*i+1, up + i + 1, n - i - 1,
  		   (up[i] << 1) + (up[i-1] >> (GMP_LIMB_BITS - 1)));
  
     Might be cheaper, if we can get this shifting done in parallel with
     other operations, and get a simpler carry propagation recurrency when
     adding diagonal and off-diagonal terms together.
     
And then handle carry-out from the last shift a a conditional add_n.

  3. The comments on using addmul_2 says that is is tricky. I think that's
     because the diagonal terms are still 1x1 products. That would get
     simpler of one treats double-limbs as the units everywhere, having
     the diagonal computation form the 2x2 products
  
       (u0, u1)^2, (u2, u3)^2, ...
  
     The total number of limb products ought to be the same, if we do each
     of these terms as
  
       (u0 + u1 B)^2 = u0^2 + B^2 u1^2 + 2B (u0 * u1)
       	           = u0^2 + B^2 u1^2 + B (u0 << 1) * u1 + B^2 u1 & HIGH_BIT_TO_MASK(u0)
  
     The off-diagonal terms, to be computed with addmul_2, are then
  
       (u0, u1) * (u2, u3, ...)
               (u2, u3) * (u4, u5, ...)
               ...
  
     I guess one can also collect the close-to-diagonal terms B u0 u1 +
     B^5 u2 u3 + ..., together with the other off-diagonal terms,
     One would then get the off-diagonal sum
  
       B u0*u1 + B^2 (u0, u1) * (u2, u3, ...)
         B^5 u2*u3 + B^6 (u2, u3) * (u4, u5, ...)
  
     which begs for an mpn_addmul_1c accepting an additional carry input
     limb. Is there such a function?
  
I beg to differ about the greatness of this approach.  It might make
some parts of the code look simpler, but will it be faster?  The C
sqr_basecase code is a bit hairy, but mainly because of the many
variants, but the many variants are there for best performance
everywhere.

  4. There's code to use mpn_addmul_2s. What is that function supposed to
     do, is it doing the above sum?
  
No.  It is a addmul_2 which suppresses a final mul to avoid including
the diagonal product.  I don't think any processor explicitly provides
it (although it is mentioned in several assembly files).  I have
explicit such sparc64 code in a repo someplace.

-- 
Torbj?rn

From rguenther at suse.de  Mon Feb 20 10:46:01 2012
From: rguenther at suse.de (Richard Guenther)
Date: Mon, 20 Feb 2012 10:46:01 +0100 (CET)
Subject: g++-3.4 bug
In-Reply-To: <alpine.DEB.2.02.1202162206320.2560@laptop-mg.saclay.inria.fr>
References: <alpine.DEB.2.02.1202162206320.2560@laptop-mg.saclay.inria.fr>
Message-ID: <alpine.LNX.2.00.1202201043430.4999@zhemvz.fhfr.qr>

On Thu, 16 Feb 2012, Marc Glisse wrote:

> Hello,
> 
> some tests currently fail on gmpxx with g++-3.4 (on shell). This is due to a
> bug in g++-3.4, which for l=2 says the following is true:
> __builtin_constant_p(l) && (l == 0)
> it is interesting to insert a printf statement that prints both l and l==0 and
> have it print 2 and true :-/

Not for me.

int main () { int l = 2; if (__builtin_constant_p (l) && (l == 0)) return 
1; return 0; }

Richard.

From nisse at lysator.liu.se  Mon Feb 20 11:59:55 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Mon, 20 Feb 2012 11:59:55 +0100
Subject: About mpn_sqr_basecase
In-Reply-To: <86linxc06h.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Mon, 20 Feb 2012 10:40:38 +0100")
References: <nn4nulx3p4.fsf@stalhein.lysator.liu.se>
	<86linxc06h.fsf@shell.gmplib.org>
Message-ID: <nnzkcdvkgk.fsf@stalhein.lysator.liu.se>

Torbjorn Granlund <tg at gmplib.org> writes:

> Why would one want to support such large sizes?

It would be nice to get rid of the SQR_TOOM2_THRESHOLD size restriction
in *all* squaring code. Since that restriction makes things a bit
brittle, and causes additional complexities for the fat case. I see good
good reason besides that.

Switching the use of the allocation areas seems like a simple way to get
rid of that. I've been looking mostly at the C sqr_basecase.

>   2. One could do the shifting differently, applying the shift to the limb
>      argument of addmul_1. Something like, when doing the off-diagonal
>      products for up[i],
>   
>        mpn_addmul_1 (rp + 2*i+1, up + i + 1, n - i - 1,
>   		   (up[i] << 1) + (up[i-1] >> (GMP_LIMB_BITS - 1)));
>   
>      Might be cheaper, if we can get this shifting done in parallel with
>      other operations, and get a simpler carry propagation recurrency when
>      adding diagonal and off-diagonal terms together.
>      
> And then handle carry-out from the last shift a a conditional add_n.

It's not that bad. In this scheme, up[i] (shifted) is multiplied by the
n-1-i limbs up[i+1, ..., n-1], i.e., fewer as i increases. The final
off-diagonal iteration (i = n-2) then adds up[n-2] * up[n-1], so if we
shift up[n-2], we only need a conditional add of the single limb
up[n-1].

[On use of addmul_2]:

> I beg to differ about the greatness of this approach.

Care to elaborate on why you expect it to be to slow? I imagine carry
handling for the close-to-diagonal terms up[2k] * up[2k+1] will be slow
without assembler support. Or should we postpone this discussion until
there's some code to compare?

BTW, what do you think about the mpn_addmul_1c entrypoint? Would it make
sense with addmul_2c as well? addmul_1c is declared in gmp-impl.h, and
it seems it's implemented on some x86_32 configurations and on
powerpc64.

Regards,
/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From tg at gmplib.org  Mon Feb 20 12:12:22 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Mon, 20 Feb 2012 12:12:22 +0100
Subject: Sandybridge addmul_N challenge
Message-ID: <868vjxbvxl.fsf@shell.gmplib.org>

The two high-end architectures for GMP are AMD K8-K10 (i.e., all Opteron
except 62xx, Athlon 64, Athlon X2, Phenom, Phenom II) and Intel
Sandybridge (i.e., socket 1155 and 2011 Core i3,i5,i7).

We have great multiplication loops for K8-K10, addmul_1 runs at 2.5 c/l
and addmul_2 runs at 2.375 c/l.  (These loops are then used in
mul_basecaee, sqr_basecase, redc_1, redc_2, and a few other places.)

But our multiplication loops for Sandybridge are much worse.  The
current addmul_1 runs at 4 c/l and addmul_2 runs at about 3.4 c/l.  I
have new code running at 3.4 c/l and 3.3 c/l respectively.

The critical instructions for these loops are MUL and ADC.  The
throughput of MUL is great for both CPUs (actually better on Sandybridge
than K8-K10).  ADC is more tricky; AMD can issue 3 per cycle with a
latency of 1 cycle, bit for Intel the situation is trickier:

In all cases the carry-out has a latency of 1 cycle.  For "ADC $0,dreg"
the latency of dreg is one cycle, but for "ADC sreg,dreg" it is two
cycles.  (It is also 2 cycles for "ADC $const,dreg" when const != 0.)

The challenge is to beat 3 c/l with either addmul_1 or addmul_2.

Success will boost GMP's general performance on these processors, since
every higher-level operation depends on these lowest-level multiply
primitives.

-- 
Torbj?rn

From foxmuldrster at yahoo.com  Mon Feb 20 12:19:32 2012
From: foxmuldrster at yahoo.com (Rick Hodgin)
Date: Mon, 20 Feb 2012 03:19:32 -0800 (PST)
Subject: Sandybridge addmul_N challenge
Message-ID: <1329736772.87971.androidMobile@web125402.mail.ne1.yahoo.com>

What source file and line?

Best regards,
Rick C. Hodgin


From marc.glisse at inria.fr  Mon Feb 20 20:39:43 2012
From: marc.glisse at inria.fr (Marc Glisse)
Date: Mon, 20 Feb 2012 20:39:43 +0100 (CET)
Subject: g++-3.4 bug
In-Reply-To: <alpine.LNX.2.00.1202201043430.4999@zhemvz.fhfr.qr>
References: <alpine.DEB.2.02.1202162206320.2560@laptop-mg.saclay.inria.fr>
	<alpine.LNX.2.00.1202201043430.4999@zhemvz.fhfr.qr>
Message-ID: <alpine.DEB.2.02.1202202018400.3110@laptop-mg.saclay.inria.fr>

On Mon, 20 Feb 2012, Richard Guenther wrote:

> On Thu, 16 Feb 2012, Marc Glisse wrote:
>
>> Hello,
>>
>> some tests currently fail on gmpxx with g++-3.4 (on shell). This is due to a
>> bug in g++-3.4, which for l=2 says the following is true:
>> __builtin_constant_p(l) && (l == 0)
>> it is interesting to insert a printf statement that prints both l and l==0 and
>> have it print 2 and true :-/
>
> Not for me.
>
> int main () { int l = 2; if (__builtin_constant_p (l) && (l == 0)) return
> 1; return 0; }

Did you try getting a snapshot of the gmp repos and running the c++ 
testsuite? Yes, your simple example passes, but you know better than I do 
how much the context matters for optimizations. And since the bug doesn't 
seem reproducible with more recent versions of gcc, there is little 
motivation to reduce the failing tests.

The failing compiler was 3.4.6, from ports on freebsd 8.1, with -O2 -m64. 
4.2.5 seems good.

In gmpxx.h, you can do this change:

struct __gmp_binary_lshift // LINE 425
{
   static void eval(mpz_ptr z, mpz_srcptr w, mp_bitcnt_t l)
   {
     if (__GMPXX_CONSTANT(l) && (l == 0))
     {
       std::cerr << l << '\n'; // NEW LINE
       if (z != w) mpz_set(z, w);
     }

(and replace <iosfwd> with <iostream> at the beginning)
and wonder why it prints "2"...


-- 
Marc Glisse

From rguenther at suse.de  Mon Feb 20 20:48:41 2012
From: rguenther at suse.de (Richard Guenther)
Date: Mon, 20 Feb 2012 20:48:41 +0100 (CET)
Subject: g++-3.4 bug
In-Reply-To: <alpine.DEB.2.02.1202202018400.3110@laptop-mg.saclay.inria.fr>
References: <alpine.DEB.2.02.1202162206320.2560@laptop-mg.saclay.inria.fr>
	<alpine.LNX.2.00.1202201043430.4999@zhemvz.fhfr.qr>
	<alpine.DEB.2.02.1202202018400.3110@laptop-mg.saclay.inria.fr>
Message-ID: <alpine.LNX.2.00.1202202045480.4999@zhemvz.fhfr.qr>

On Mon, 20 Feb 2012, Marc Glisse wrote:

> On Mon, 20 Feb 2012, Richard Guenther wrote:
> 
> > On Thu, 16 Feb 2012, Marc Glisse wrote:
> > 
> > > Hello,
> > > 
> > > some tests currently fail on gmpxx with g++-3.4 (on shell). This is due to
> > > a
> > > bug in g++-3.4, which for l=2 says the following is true:
> > > __builtin_constant_p(l) && (l == 0)
> > > it is interesting to insert a printf statement that prints both l and l==0
> > > and
> > > have it print 2 and true :-/
> > 
> > Not for me.
> > 
> > int main () { int l = 2; if (__builtin_constant_p (l) && (l == 0)) return
> > 1; return 0; }
> 
> Did you try getting a snapshot of the gmp repos and running the c++ testsuite?
> Yes, your simple example passes, but you know better than I do how much the
> context matters for optimizations. And since the bug doesn't seem reproducible
> with more recent versions of gcc, there is little motivation to reduce the
> failing tests.

Ah, ok - I thought you might have one.  I'm not really interested in
GCC 3.4.x bugs either - after all this version has been out of 
maintainance for six years...

> The failing compiler was 3.4.6, from ports on freebsd 8.1, with -O2 -m64.
> 4.2.5 seems good.

... has have 4.2.x and 4.3.x.  But it seems freebsd is stuck with 4.2.2,
the last release with GPLv2.  I suppose for freebsd testing should focus
on LLVM.

Richard.

From tg at gmplib.org  Mon Feb 20 20:59:00 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Mon, 20 Feb 2012 20:59:00 +0100
Subject: g++-3.4 bug
In-Reply-To: <alpine.LNX.2.00.1202202045480.4999@zhemvz.fhfr.qr> (Richard
	Guenther's message of "Mon\,
	20 Feb 2012 20\:48\:41 +0100 \(CET\)")
References: <alpine.DEB.2.02.1202162206320.2560@laptop-mg.saclay.inria.fr>
	<alpine.LNX.2.00.1202201043430.4999@zhemvz.fhfr.qr>
	<alpine.DEB.2.02.1202202018400.3110@laptop-mg.saclay.inria.fr>
	<alpine.LNX.2.00.1202202045480.4999@zhemvz.fhfr.qr>
Message-ID: <86ehtp46pn.fsf@shell.gmplib.org>

Richard Guenther <rguenther at suse.de> writes:

  ... has have 4.2.x and 4.3.x.  But it seems freebsd is stuck with 4.2.2,
  the last release with GPLv2.  I suppose for freebsd testing should focus
  on LLVM.
  
I think differently.  I am developing GMP on FreeBSD systems, and use
gcc there.  I don't waste any time on LLVM.

It is a pity they make this play about GPLv3, but I don't think they'll
get out of their block for many years.  In the meantime, one may always
install things from /usr/ports/lang/gcc*.

-- 
Torbj?rn

From marc.glisse at inria.fr  Mon Feb 20 21:30:38 2012
From: marc.glisse at inria.fr (Marc Glisse)
Date: Mon, 20 Feb 2012 21:30:38 +0100 (CET)
Subject: g++-3.4 bug
In-Reply-To: <alpine.LNX.2.00.1202202045480.4999@zhemvz.fhfr.qr>
References: <alpine.DEB.2.02.1202162206320.2560@laptop-mg.saclay.inria.fr>
	<alpine.LNX.2.00.1202201043430.4999@zhemvz.fhfr.qr>
	<alpine.DEB.2.02.1202202018400.3110@laptop-mg.saclay.inria.fr>
	<alpine.LNX.2.00.1202202045480.4999@zhemvz.fhfr.qr>
Message-ID: <alpine.DEB.2.02.1202202113040.3110@laptop-mg.saclay.inria.fr>

On Mon, 20 Feb 2012, Richard Guenther wrote:

> Ah, ok - I thought you might have one.  I'm not really interested in
> GCC 3.4.x bugs either - after all this version has been out of
> maintainance for six years...

Yes, my main concern is whether I should let people notice that the 
testsuite is failing so they try a more recent compiler, or work around it 
by disabling the use of __builtin_constant_p for 3.4.* (and anything 
older?).

> ... has have 4.2.x and 4.3.x.  But it seems freebsd is stuck with 4.2.2,
> the last release with GPLv2.  I suppose for freebsd testing should focus
> on LLVM.

which last I checked didn't work with the master repos ;-)
(it doesn't understand the instruction jb,pt in mpn/x86_64/mod_34lsub1.asm 
(and neither does oracle))

Note that people who accept gplv2 but not gplv3 are fairly likely to be 
unhappy with gmp anyway...

-- 
Marc Glisse

From tg at gmplib.org  Mon Feb 20 21:49:53 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Mon, 20 Feb 2012 21:49:53 +0100
Subject: g++-3.4 bug
In-Reply-To: <alpine.DEB.2.02.1202202113040.3110@laptop-mg.saclay.inria.fr>
	(Marc Glisse's message of "Mon\,
	20 Feb 2012 21\:30\:38 +0100 \(CET\)")
References: <alpine.DEB.2.02.1202162206320.2560@laptop-mg.saclay.inria.fr>
	<alpine.LNX.2.00.1202201043430.4999@zhemvz.fhfr.qr>
	<alpine.DEB.2.02.1202202018400.3110@laptop-mg.saclay.inria.fr>
	<alpine.LNX.2.00.1202202045480.4999@zhemvz.fhfr.qr>
	<alpine.DEB.2.02.1202202113040.3110@laptop-mg.saclay.inria.fr>
Message-ID: <86aa4d44cu.fsf@shell.gmplib.org>

Marc Glisse <marc.glisse at inria.fr> writes:

  Yes, my main concern is whether I should let people notice that the
  testsuite is failing so they try a more recent compiler, or work
  around it by disabling the use of __builtin_constant_p for 3.4.* (and
  anything older?).
  
If just the test suite is miscompiled, and the compiler is actually
still used, then we might as well make a (trivial) workaround in the
test suite.

Note that we resisted the temptation to work around the GCC 4.3.2 bug
that made mpn/generic/rootrem.c become miscompiled.  In this case, we
had a compiler which was very much used, but no reasonable workaround
was found.

  which last I checked didn't work with the master repos ;-)
  (it doesn't understand the instruction jb,pt in
  mpn/x86_64/mod_34lsub1.asm (and neither does oracle))
  
I suppose we could as well remove it.  Now done.

-- 
Torbj?rn

From marc.glisse at inria.fr  Tue Feb 21 21:31:00 2012
From: marc.glisse at inria.fr (Marc Glisse)
Date: Tue, 21 Feb 2012 21:31:00 +0100 (CET)
Subject: g++-3.4 bug
In-Reply-To: <86aa4d44cu.fsf@shell.gmplib.org>
References: <alpine.DEB.2.02.1202162206320.2560@laptop-mg.saclay.inria.fr>
	<alpine.LNX.2.00.1202201043430.4999@zhemvz.fhfr.qr>
	<alpine.DEB.2.02.1202202018400.3110@laptop-mg.saclay.inria.fr>
	<alpine.LNX.2.00.1202202045480.4999@zhemvz.fhfr.qr>
	<alpine.DEB.2.02.1202202113040.3110@laptop-mg.saclay.inria.fr>
	<86aa4d44cu.fsf@shell.gmplib.org>
Message-ID: <alpine.DEB.2.02.1202212111070.3574@laptop-mg.saclay.inria.fr>

On Mon, 20 Feb 2012, Torbjorn Granlund wrote:

> Marc Glisse <marc.glisse at inria.fr> writes:
>
>  Yes, my main concern is whether I should let people notice that the
>  testsuite is failing so they try a more recent compiler, or work
>  around it by disabling the use of __builtin_constant_p for 3.4.* (and
>  anything older?).
>
> If just the test suite is miscompiled,

libgmp.* and libgmpxx.* seem fine.

> and the compiler is actually still used, then we might as well make a 
> (trivial) workaround in the test suite.

I am not sure what you mean. Note that as is, a g++34 user who multiplies 
a mpz_class by 4u (what the testsuite does) in his code is likely to hit 
the bug. I am not really happy with hiding the testsuite failure, I'd 
rather either let the testsuite noisily fail, or (trivially) work around 
the bug in gmpxx.h so the user's code is safe too.

(now that I think about it, there is a testsuite failure on solaris 
(likely with g++-3.4.3) visible at http://hydra.nixos.org/build/2112917 
when creating a mpz_class from an mpz_t, 3.4 is really an unlucky number)

>  which last I checked didn't work with the master repos ;-)
>  (it doesn't understand the instruction jb,pt in
>  mpn/x86_64/mod_34lsub1.asm (and neither does oracle))
>
> I suppose we could as well remove it.  Now done.

Thanks.

-- 
Marc Glisse

From nisse at lysator.liu.se  Tue Feb 21 21:52:36 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Tue, 21 Feb 2012 21:52:36 +0100
Subject: About mpn_sqr_basecase
In-Reply-To: <nnzkcdvkgk.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?=
	message of "Mon, 20 Feb 2012 11:59:55 +0100")
References: <nn4nulx3p4.fsf@stalhein.lysator.liu.se>
	<86linxc06h.fsf@shell.gmplib.org>
	<nnzkcdvkgk.fsf@stalhein.lysator.liu.se>
Message-ID: <nnehtnvrhn.fsf@stalhein.lysator.liu.se>

I wrote:

>>   2. One could do the shifting differently, applying the shift to the limb
>>      argument of addmul_1.

Torbj?rn:

>> And then handle carry-out from the last shift a a conditional add_n.

I:

> It's not that bad. In this scheme, up[i] (shifted) is multiplied by the
> n-1-i limbs up[i+1, ..., n-1], i.e., fewer as i increases. The final
> off-diagonal iteration (i = n-2) then adds up[n-2] * up[n-1], so if we
> shift up[n-2], we only need a conditional add of the single limb
> up[n-1].

It turns out there are a more conditional adds than I thought. When we
call addmul_1 with a shifted limb, the next addmul_1 is shorter. So
adding the carry to the next addmul_1 call is *almost* enough, but we
also have to apply it to the limb not included in that next addmul_1
call. It boils down to a conditional add to the next diagonal product.

Implementation below. It's quite nice and simple, a single loop which in
each iteration computes one diagonal term, calls addmul_1c for
off-diagonal terms (that's the bulk of the work, naturally), and then a
bit of extra fiddling to get the shifting correct. If this algorithm can
be faster than the current mpn_addmul_1-based code is less clear. It's
the O(n) work of MPN_SQR_DIAG_ADDLSH1 compared to the additional O(n)
work done for on-the-fly shifting.

One could hope that the additional shift and carry logic can be
scheduled in parallel with the multiplication work in (the previous)
call of addmul_1c. What's the bottleneck for addmul_1, is it multiplier
throughput, carry propagation latency, or decoder bandwidth?

Regards,
/Niels

#define sqr_2(r3, r2, r1, r0, u1, u0) do {				\
    mp_limb_t __p1, __p0, __t, __cy, __u1p;				\
    umul_ppmm (__t, (r0), (u0), (u0));					\
    umul_ppmm (__p1, __p0, (u0) << 1, (u1));				\
    __cy = (u0) >> (GMP_LIMB_BITS - 1);					\
    add_ssaaaa (__t, (r1), __p1, __p0, 0, __t);				\
    __u1p = (u1) + __cy;						\
    __cy = __u1p < __cy;						\
    umul_ppmm (__p1, __p0, (u1), __u1p);				\
    add_ssaaaa ((r3), (r2), __p1, __p0, -__cy, __t);			\
  } while (0)

/* Squaring with on-the-fly shifting for the off-diagonal elements.
   Let

     uc[i] = up[i] >> (GMP_LIMB_BITS - 1)
     us[0] = 2 up[0] mod B,
     us[i] = 2 up[i] mod B + uc[i-1], for i > 0

   In the first iteration, we compute

     up[0] * up[0] + B us[0] * <up[1], up[2], ..., up[n-1]>

   In iteration i, we add in

     B^{2i} (up[i] * (up[i] + uc[i-1]) + B us[i] * <up[i+1], ..., up[n-1]>)

   We have an unlikely carry from the addition up[i] + uc[i-1]. The
   current handling is a bit tricky. A simpler alternative is to
   compute the product up[i]^2, and conditionally add in up[i] to the
   result. */

void
sqr_basecase_1 (mp_ptr rp, mp_srcptr up, mp_size_t n)
{
  mp_size_t i;
  mp_limb_t ul, ulp, uh, p2, p1, p0, c1, c0, t;

  if (n == 1)
    {
      mp_limb_t ul = up[0];
      umul_ppmm (rp[1], rp[0], ul, ul);
      return;
    }
  else if (n == 2)
    {
      mp_limb_t u0, u1;
      u0 = up[0];
      u1 = up[1];
      sqr_2 (rp[3], rp[2], rp[1], rp[0], u1, u0);
      return;
    }

  ul = up[0];
  umul_ppmm (p1, rp[0], ul, ul);
  rp[n] = mpn_mul_1c (rp+1, up+1, n-1, ul<<1, p1);

  for (i = 1; i < n-2; i++)
    {
      c0 = ul >> (GMP_LIMB_BITS - 1);
      ul = up[i];
      ulp = ul + c0;
      c1 = ulp < c0;
      umul_ppmm (p1, p0, ul, ulp);
      add_ssaaaa (p1, rp[2*i], p1, p0, -c1, rp[2*i]);

      rp[n+i] = mpn_addmul_1c (rp + 2*i+1, up + i + 1, n - i - 1,
			       (ul << 1) + c0, p1);
    }
  /* Handle i = n-2 */;
  c0 = ul >> (GMP_LIMB_BITS - 1);
  ul = up[n-2];
  ulp = ul + c0;
  c1 = ulp < c0;
  umul_ppmm (p1, p0, ul, ulp);
  add_ssaaaa (p1, rp[2*n-4], p1, p0, -c1, rp[2*n-4]);
  
  uh = up[n-1];
  umul_ppmm (p2, p0, (ul << 1) + c0, uh);
  ADDC_LIMB (c0, t, p1, rp[2*n-3]);
  add_ssaaaa (p2, rp[2*n-3], p2, p0, c0, t);
  
  /* Handle i = n-1 */
  c0 = ul >> (GMP_LIMB_BITS - 1);
  ulp = uh + c0;
  c1 = ulp < c0;
  umul_ppmm (p1, p0, uh, ulp);
  add_ssaaaa (rp[2*n-1], rp[2*n-2], p1, p0, -c1, p2);
}

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From nisse at lysator.liu.se  Wed Feb 22 11:18:52 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Wed, 22 Feb 2012 11:18:52 +0100
Subject: Sandybridge addmul_N challenge
In-Reply-To: <868vjxbvxl.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Mon, 20 Feb 2012 12:12:22 +0100")
References: <868vjxbvxl.fsf@shell.gmplib.org>
Message-ID: <nnaa4buq5v.fsf@stalhein.lysator.liu.se>

Torbjorn Granlund <tg at gmplib.org> writes:

> But our multiplication loops for Sandybridge are much worse.  The
> current addmul_1 runs at 4 c/l and addmul_2 runs at about 3.4 c/l.  I
> have new code running at 3.4 c/l and 3.3 c/l respectively.

Let me think aloud for a bit.

I'm still not sure what the addmul_1 bottleneck is. I note that in the
general x86_64 and core2 code, for each mul, we do two *dependent* adds
before storing the low limb. With the cin and cout for the carrry limbs,
the operations per limb are basically

  mov   (up), %rax
  mul	v
  add	%rax, cin
  xor	cout, cout
  adc	$0, cout
  add	cin, (rp)
  adc	%rdx, cout

The real code is a lot more clever with unrolling and scheduling, but if
I read it correctly, these are the operations done.

One could reorder this as follows,

  mov   (up), %rax
  mul	v
  xor	cout, cout
  add	(rp), cin
  adc	$0, cout
  add	%rax, cin
  adc	%rdx, cout
  mov	cin, (rp)

Then we get a bit more independence of operations, since the first add
is independent of the result of the mul. It costs an instruction with
the extra mov to rp. But I suspect that's still better than updating
memory twice, like

  mov   (up), %rax
  mul	v
  xor	cout, cout
  add	cin, (rp)
  adc	$0, cout
  add	%rax, (rp)
  adc	%rdx, cout

The recurrency depth seems to be the same in all cases, though, with a
latency of add + adc + adc from cin to cout. If that's what's killing
performance, maybe this would be better,

  mov   (up), %rax
  mul	v
  xor	cout, cout
  add	(rp), %rax
  adc	%rdx, cout
  add	%rax, cin
  adc	$0, cout
  mov	cin, (rp)

with a recurrency latency of only add + adc (where the adc in question
has a $0 source operand). If I understand you correctly, that would be
only two cycles on sandybridge. It seems the current sbr code does
something similar? Then we have to rely on scheduling or out-of-order
execution to not get a useless wait between the mul and the first add.

Is instruction issue still limited to three instructions per cycle? Then
that's a more narrow bottleneck than both latency and multiplication
throughput. The above is eight instructions per limb, excluding looping.
So one could at least *hope* to get down to 3 cycles with a moderate
level of unrolling.

Maybe one could try a variant updating (rp) twice, to save an
instruction:

  mov   (up), %rax
  mul	v
  xor	cout, cout
  add	%rax, (rp)
  adc	%rdx, cout
  add	cin, (rp)
  adc	$0, cout

Is it possible to squeeze those three memory instructions in less than 3
cycles?

If we try to get the instruction count below 9 (like above), then
there's not much room for moving around %rax and %rdx. But the critical
recurrency, "add cin, ...; adc $0, cout", doesn't involve %rax and %rdx,
leaving a bit freedom to move it between iterations if we unroll and use
multiple registers for cin and cout. Hmm, let me give one more variant,
which moves %rax and %rdx out of the way, at the cost of one more
instruction (8 instructions per limb):

  mov   (up), %rax
  mul	v
  mov	%rax, pl
  mov	%rdx, cout
  add	pl, (rp)
  adc	$0, cout
  add	cin, (rp)
  adc	$0, cout

I guess one ought to bring out the loop mixer to find out if any of this
really can run at 3 cycles.

Regards,
/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From tg at gmplib.org  Wed Feb 22 16:30:55 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Wed, 22 Feb 2012 16:30:55 +0100
Subject: Sandybridge addmul_N challenge
In-Reply-To: <nnaa4buq5v.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?= message of "Wed\, 22 Feb 2012 11\:18\:52
	+0100")
References: <868vjxbvxl.fsf@shell.gmplib.org>
	<nnaa4buq5v.fsf@stalhein.lysator.liu.se>
Message-ID: <86sji2khqo.fsf@shell.gmplib.org>

nisse at lysator.liu.se (Niels M?ller) writes:

  The recurrency depth seems to be the same in all cases, though, with a
  latency of add + adc + adc from cin to cout. If that's what's killing
  performance, maybe this would be better,
  
The x86_64/mul_1.asm code has adc+adc with register operands on the
recurrency chain, i.e., 2 cycles on any AMD chip and 4 cycles on any
Intel chip (except Pentium4 which has about 20 cycles).

    mov   (up), %rax
    mul	v
    xor	cout, cout
    add	(rp), %rax
    adc	%rdx, cout
    add	%rax, cin
    adc	$0, cout
    mov	cin, (rp)
  
  with a recurrency latency of only add + adc (where the adc in question
  has a $0 source operand). If I understand you correctly, that would be
  only two cycles on sandybridge. It seems the current sbr code does
  something similar? Then we have to rely on scheduling or out-of-order
  execution to not get a useless wait between the mul and the first add.
  
I think your code has indeed a 2c recurrency path.

It is similar to the sandybridge code, except that it turned out to be
better to use adc $0 in both places possible.

  Is instruction issue still limited to three instructions per cycle? Then
  that's a more narrow bottleneck than both latency and multiplication
  throughput. The above is eight instructions per limb, excluding looping.
  So one could at least *hope* to get down to 3 cycles with a moderate
  level of unrolling.

I don't think Intel's processors like "op mem,reg" very much.  I tried
the code above (with the loop mixer) and its runs slower than the code I
checked in (which runs slightly faster than claimed, 3.25 c/l, mot 3.4
c/l).
  
We still have 3 insn/c, except that some adjacent insn pairs are fused.
I don't know exactly which, but test+jcc and cmp+jcc might be the only
ones.

My new addmul_1 has 38 insn, no fusing expected so 3.17 c/l is a decoder
imposed limit.

  Maybe one could try a variant updating (rp) twice, to save an
  instruction:
  
    mov   (up), %rax
    mul	v
    xor	cout, cout
    add	%rax, (rp)
    adc	%rdx, cout
    add	cin, (rp)
    adc	$0, cout
  
  Is it possible to squeeze those three memory instructions in less than 3
  cycles?
  
Intel processors likes "op reg,mem" even less...

  If we try to get the instruction count below 9 (like above), then
  there's not much room for moving around %rax and %rdx. But the critical
  recurrency, "add cin, ...; adc $0, cout", doesn't involve %rax and %rdx,
  leaving a bit freedom to move it between iterations if we unroll and use
  multiple registers for cin and cout. Hmm, let me give one more variant,
  which moves %rax and %rdx out of the way, at the cost of one more
  instruction (8 instructions per limb):
  
    mov   (up), %rax
    mul	v
    mov	%rax, pl
    mov	%rdx, cout
    add	pl, (rp)
    adc	$0, cout
    add	cin, (rp)
    adc	$0, cout
  
  I guess one ought to bring out the loop mixer to find out if any of this
  really can run at 3 cycles.
  
The sandybridge machine tom behind shell is waiting for you.  A nehalem
machine is biko* (lots of Xen machines on the same hardware) and
repentium is a Conroe ("core2").

I recently played with karatsuba-based addmul_2 for s390.  For MUL-
challenged machines, that might be an option.  Sandybridge is OK there,
but Bull-dozer is not.  A disadvantage is that such code would be
unsuitable for code which wants to be "side-channel silent".

-- 
Torbj?rn

From tg at gmplib.org  Wed Feb 22 17:03:54 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Wed, 22 Feb 2012 17:03:54 +0100
Subject: Sandybridge addmul_N challenge
In-Reply-To: <nnaa4buq5v.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?= message of "Wed\, 22 Feb 2012 11\:18\:52
	+0100")
References: <868vjxbvxl.fsf@shell.gmplib.org>
	<nnaa4buq5v.fsf@stalhein.lysator.liu.se>
Message-ID: <86ipiykg7p.fsf@shell.gmplib.org>

I doubt we can make addmul_1 run faster on sandybridge.

But I'd like mul_basecase to run much faster than 3 c/l.  Then
sqr_basecase and redc_1, redc_2 should be fixed.

An addmul_2 running better at 3 c/l or better would be great.  That
means we need to handle a "tick" in it using <= 17 insns, probably
avoiding "op reg,mem" an "op mem.reg".  (If we use 18 insn, loop
handling will bring us over 3 c/l.)

For mul_basecase and sqr_basecase we could perhaps work vertically,
summing into 3 registers.  I.e., pretend we really multiply polynomials,
performing no (recurrency) carry propagation until we reach the bottom.
I havn't tried this, but I think this might be really promising for
Intel's last *two* main generations (Nehalem/Westmere and
Sandybridge/Ivybridge).

Perhaps would could get close to 2 c/l with this approach, unless
register shortage messes things up to badly.

I haven't thought about doing redc_1/redc_2 using this approach.
Hensel lifting on-the-fly could be interesting...

-- 
Torbj?rn

From nisse at lysator.liu.se  Wed Feb 22 19:16:23 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Wed, 22 Feb 2012 19:16:23 +0100
Subject: Sandybridge addmul_N challenge
In-Reply-To: <86sji2khqo.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Wed, 22 Feb 2012 16:30:55 +0100")
References: <868vjxbvxl.fsf@shell.gmplib.org>
	<nnaa4buq5v.fsf@stalhein.lysator.liu.se>
	<86sji2khqo.fsf@shell.gmplib.org>
Message-ID: <nn39a2vimg.fsf@stalhein.lysator.liu.se>

Torbjorn Granlund <tg at gmplib.org> writes:

> I recently played with karatsuba-based addmul_2 for s390.  For MUL-
> challenged machines, that might be an option.  Sandybridge is OK there,
> but Bull-dozer is not.  A disadvantage is that such code would be
> unsuitable for code which wants to be "side-channel silent".

I'm pretty sure one can do Karatsuba branchfree. Not sure one can do
that and do it fast (but if the alternative is branches, they're costly
too).

Evaluation:

  mov	u0, a
  sub	u1, a
  lea	(u0, u1), t
  cmovc	t, a
  sbb	mask, mask

Now a = |u0 - u1|, with sign in mask. Can easily be done also without
cmov, but with a longer chain of dependent instructions:

  mov	u0, a
  sub	u1, a
  sbb	mask, mask
  xor	mask, a
  sub	mask, a
  
Get b = |v0 - v1| in the same way, and arrange so that the final mask is
all ones if the term |u0 - u1| * |v0 - v1| should be subtracted, i.e., if
(u0 - u1)(v0 - v1) >= 0.

Interpolation:

Add in the signed term (u0 - u1) * (v0 - v1) to the three limbs
<r3,r2,r1>, using two's complement:

  mov	a, %rax
  mul	b
  xor	mask, %rax
  xor	mask, %rdx
  bt	$0, mask	C Set carry from mask. Any better way?
  C <mask, %rdx, %rax> + c = - (u0 - u1) (v0 - v1), in two's complement.
  adc	%rax, r1
  adc	%rdx, r2
  adc	mask, r3

Alternatively, one could use (u0 + u1) * (v0 + v1), with a couple of
conditional adds instead.
  
/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From marc.glisse at inria.fr  Wed Feb 22 19:18:58 2012
From: marc.glisse at inria.fr (Marc Glisse)
Date: Wed, 22 Feb 2012 19:18:58 +0100 (CET)
Subject: _mp_alloc vs ALLOC
Message-ID: <alpine.DEB.2.02.1202191341100.3545@laptop-mg.saclay.inria.fr>

Hello,

is there any objection if I replace most uses of ->_mp_alloc by calls to 
the ALLOC macro in mp[zqf] (and similarly for _mp_size, etc)? It helps 
when experimenting... I am also considering moving the NUM and DEN macros 
from test/mpq/t-cmp* to gmp-impl.h, since I assume mpq_numref and 
mpq_denref are not used much internally because of their length. By the 
way, is there any difference between PTR and LIMBS? Say one that should be 
used in some circumstances and one in others?

Unrelated, I was thinking of changing (when gmp is compiled with a C++ 
compiler, so that wouldn't affect many people...) the definitions of 
TMP_DECL and TMP_FREE so TMP_DECL would create a variable whose destructor 
(executed when the variable goes out of scope, which shouldn't be far from 
where TMP_FREE is currently called) does what TMP_FREE currently does. The 
advantage is that in case an exception is thrown in between, the 
destructor is executed. That doesn't solve all memory issues by far, but 
it is a first step that costs little in terms of code and 0 in speed.

I am not saying I will do either any time soon, just checking first.

-- 
Marc Glisse

From tg at gmplib.org  Wed Feb 22 19:41:17 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Wed, 22 Feb 2012 19:41:17 +0100
Subject: _mp_alloc vs ALLOC
In-Reply-To: <alpine.DEB.2.02.1202191341100.3545@laptop-mg.saclay.inria.fr>
	(Marc Glisse's message of "Wed\,
	22 Feb 2012 19\:18\:58 +0100 \(CET\)")
References: <alpine.DEB.2.02.1202191341100.3545@laptop-mg.saclay.inria.fr>
Message-ID: <86d396soc2.fsf@shell.gmplib.org>

Marc Glisse <marc.glisse at inria.fr> writes:

  is there any objection if I replace most uses of ->_mp_alloc by calls
  to the ALLOC macro in mp[zqf] (and similarly for _mp_size, etc)? It
  helps when experimenting... I am also considering moving the NUM and
  DEN macros from test/mpq/t-cmp* to gmp-impl.h, since I assume
  mpq_numref and mpq_denref are not used much internally because of
  their length. By the way, is there any difference between PTR and
  LIMBS? Say one that should be used in some circumstances and one in
  others?
  
You're welcome to clean up this.  The macro LIMBS is used in just one
file, AFAICT, I have  no idea why it exists

  Unrelated, I was thinking of changing (when gmp is compiled with a C++
  compiler, so that wouldn't affect many people...) the definitions of
  TMP_DECL and TMP_FREE so TMP_DECL would create a variable whose
  destructor (executed when the variable goes out of scope, which
  shouldn't be far from where TMP_FREE is currently called) does what
  TMP_FREE currently does. The advantage is that in case an exception is
  thrown in between, the destructor is executed. That doesn't solve all
  memory issues by far, but it is a first step that costs little in
  terms of code and 0 in speed.
  
That'd be fine too.

-- 
Torbj?rn

From bodrato at mail.dm.unipi.it  Wed Feb 22 20:28:52 2012
From: bodrato at mail.dm.unipi.it (bodrato at mail.dm.unipi.it)
Date: Wed, 22 Feb 2012 20:28:52 +0100 (CET)
Subject: _mp_alloc vs ALLOC
In-Reply-To: <86d396soc2.fsf@shell.gmplib.org>
References: <alpine.DEB.2.02.1202191341100.3545@laptop-mg.saclay.inria.fr>
	<86d396soc2.fsf@shell.gmplib.org>
Message-ID: <49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it>

Ciao,

Il Mer, 22 Febbraio 2012 7:41 pm, Torbjorn Granlund ha scritto:
> Marc Glisse <marc.glisse at inria.fr> writes:

>   their length. By the way, is there any difference between PTR and
>   LIMBS? Say one that should be used in some circumstances and one in
>   others?
>
> You're welcome to clean up this.  The macro LIMBS is used in just one
> file, AFAICT, I have  no idea why it exists

I suspect there are other unused macros hanging in gmp-impl.h ...

>   Unrelated, I was thinking of changing (when gmp is compiled with a C++
...
>   TMP_DECL and TMP_FREE so TMP_DECL would create a variable whose

Unrelated :-) We might define more macros like TMP_ALLOC_LIMBS_2 . I mean
_3 and _4. So that they can be used to reduce the number of allocations.
Do you agree? (I just touched mpz/gcdext.c, and _4 should be used there).

Regards,
m

-- 
http://bodrato.it/


From tg at gmplib.org  Wed Feb 22 20:32:18 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Wed, 22 Feb 2012 20:32:18 +0100
Subject: _mp_alloc vs ALLOC
In-Reply-To: <49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it>
	(bodrato@mail.dm.unipi.it's message of "Wed\,
	22 Feb 2012 20\:28\:52 +0100 \(CET\)")
References: <alpine.DEB.2.02.1202191341100.3545@laptop-mg.saclay.inria.fr>
	<86d396soc2.fsf@shell.gmplib.org>
	<49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it>
Message-ID: <86zkcar7el.fsf@shell.gmplib.org>

bodrato at mail.dm.unipi.it writes:

  Unrelated :-) We might define more macros like TMP_ALLOC_LIMBS_2 . I mean
  _3 and _4. So that they can be used to reduce the number of allocations.
  Do you agree? (I just touched mpz/gcdext.c, and _4 should be used there).
  
I'd vote for killing TMP_ALLOC_LIMBS_2 rather than add TMP_ALLOC_LIMBS_N
for some range of N.

Please look at the generated code from TMP_ALLOC from any reasonable
compiler.  It is a sub from sp, the a copy from sp to the target
variable.  Cost: about 1 cycle.

TMP_ALLOC_LIMBS_2 is clutter IMHO.

-- 
Torbj?rn

From nisse at lysator.liu.se  Wed Feb 22 20:57:31 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Wed, 22 Feb 2012 20:57:31 +0100
Subject: _mp_alloc vs ALLOC
In-Reply-To: <86zkcar7el.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Wed, 22 Feb 2012 20:32:18 +0100")
References: <alpine.DEB.2.02.1202191341100.3545@laptop-mg.saclay.inria.fr>
	<86d396soc2.fsf@shell.gmplib.org>
	<49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it>
	<86zkcar7el.fsf@shell.gmplib.org>
Message-ID: <nnpqd6tzdg.fsf@stalhein.lysator.liu.se>

Torbjorn Granlund <tg at gmplib.org> writes:

> TMP_ALLOC_LIMBS_2 is clutter IMHO.

Sure, it's pointless in a normal build.

As I understand it, the reason for having TMP_ALLOC_LIMBS_2 is to make
--enable-alloca=debug more effective, by getting some kind of red zone
separating the two areas. Whether or not that's worth the clutter, I'm
not sure.

Regards,
/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From tg at gmplib.org  Wed Feb 22 21:02:55 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Wed, 22 Feb 2012 21:02:55 +0100
Subject: _mp_alloc vs ALLOC
In-Reply-To: <nnpqd6tzdg.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?= message of "Wed\, 22 Feb 2012 20\:57\:31
	+0100")
References: <alpine.DEB.2.02.1202191341100.3545@laptop-mg.saclay.inria.fr>
	<86d396soc2.fsf@shell.gmplib.org>
	<49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it>
	<86zkcar7el.fsf@shell.gmplib.org>
	<nnpqd6tzdg.fsf@stalhein.lysator.liu.se>
Message-ID: <86pqd6r5zk.fsf@shell.gmplib.org>

nisse at lysator.liu.se (Niels M?ller) writes:

  Torbjorn Granlund <tg at gmplib.org> writes:
  
  > TMP_ALLOC_LIMBS_2 is clutter IMHO.
  
  Sure, it's pointless in a normal build.
  
  As I understand it, the reason for having TMP_ALLOC_LIMBS_2 is to make
  --enable-alloca=debug more effective, by getting some kind of red zone
  separating the two areas. Whether or not that's worth the clutter, I'm
  not sure.
  
Surely a plain TMP_ALLOC adds red zones?  If not, that is something we
ought to fix.

-- 
Torbj?rn

From marc.glisse at inria.fr  Wed Feb 22 21:20:23 2012
From: marc.glisse at inria.fr (Marc Glisse)
Date: Wed, 22 Feb 2012 21:20:23 +0100 (CET)
Subject: _mp_alloc vs ALLOC
In-Reply-To: <nnpqd6tzdg.fsf@stalhein.lysator.liu.se>
References: <alpine.DEB.2.02.1202191341100.3545@laptop-mg.saclay.inria.fr>
	<86d396soc2.fsf@shell.gmplib.org>
	<49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it>
	<86zkcar7el.fsf@shell.gmplib.org>
	<nnpqd6tzdg.fsf@stalhein.lysator.liu.se>
Message-ID: <alpine.DEB.2.02.1202222110040.2582@laptop-mg.saclay.inria.fr>

On Wed, 22 Feb 2012, Torbjorn Granlund wrote:

> bodrato at mail.dm.unipi.it writes:
>
>  Unrelated :-) We might define more macros like TMP_ALLOC_LIMBS_2 . I mean
>  _3 and _4. So that they can be used to reduce the number of allocations.
>  Do you agree? (I just touched mpz/gcdext.c, and _4 should be used there).
>
> I'd vote for killing TMP_ALLOC_LIMBS_2 rather than add TMP_ALLOC_LIMBS_N
> for some range of N.
>
> Please look at the generated code from TMP_ALLOC from any reasonable
> compiler.  It is a sub from sp, the a copy from sp to the target
> variable.  Cost: about 1 cycle.

That's for the alloca case. Without alloca, one call to malloc is better 
than two (although that usually also means the numbers are big and any gmp 
operation will dwarf allocation). Also, the threshold between alloca and 
malloc is quite high, and with many separate allocations that all barely 
fit below this threshold, the total amount of stack memory used can become 
too large for some applications (lowering the threshold may be easier than 
allocating things in groups though).


On Wed, 22 Feb 2012, Niels M?ller wrote:

> Torbjorn Granlund <tg at gmplib.org> writes:
>
>> TMP_ALLOC_LIMBS_2 is clutter IMHO.
>
> Sure, it's pointless in a normal build.
>
> As I understand it, the reason for having TMP_ALLOC_LIMBS_2 is to make
> --enable-alloca=debug more effective, by getting some kind of red zone
> separating the two areas. Whether or not that's worth the clutter, I'm
> not sure.

Er, I guess you mean TMP_ALLOC_LIMBS_2 as opposed to a single call to 
TMP_ALLOC_LIMBS manually split in two, not as opposed to 2 calls to 
TMP_ALLOC_LIMBS.

-- 
Marc Glisse

From tg at gmplib.org  Wed Feb 22 21:24:06 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Wed, 22 Feb 2012 21:24:06 +0100
Subject: _mp_alloc vs ALLOC
In-Reply-To: <alpine.DEB.2.02.1202222110040.2582@laptop-mg.saclay.inria.fr>
	(Marc Glisse's message of "Wed\,
	22 Feb 2012 21\:20\:23 +0100 \(CET\)")
References: <alpine.DEB.2.02.1202191341100.3545@laptop-mg.saclay.inria.fr>
	<86d396soc2.fsf@shell.gmplib.org>
	<49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it>
	<86zkcar7el.fsf@shell.gmplib.org>
	<nnpqd6tzdg.fsf@stalhein.lysator.liu.se>
	<alpine.DEB.2.02.1202222110040.2582@laptop-mg.saclay.inria.fr>
Message-ID: <86linur509.fsf@shell.gmplib.org>

Marc Glisse <marc.glisse at inria.fr> writes:

  That's for the alloca case. Without alloca, one call to malloc is
  better than two (although that usually also means the numbers are big
  and any gmp operation will dwarf allocation). Also, the threshold
  between alloca and malloc is quite high, and with many separate
  allocations that all barely fit below this threshold, the total amount
  of stack memory used can become too large for some applications
  (lowering the threshold may be easier than allocating things in groups
  though).
  
I don't buy this argument.

If the threshold is high, then surely the malloc time will not take a
significant fraction of the total time.

If the threshold is too high, then we should lower it.

Is there no good range for the threshold?  Show me the numbers...!

-- 
Torbj?rn

From nisse at lysator.liu.se  Wed Feb 22 22:36:39 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Wed, 22 Feb 2012 22:36:39 +0100
Subject: Sandybridge addmul_N challenge
In-Reply-To: <86sji2khqo.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Wed, 22 Feb 2012 16:30:55 +0100")
References: <868vjxbvxl.fsf@shell.gmplib.org>
	<nnaa4buq5v.fsf@stalhein.lysator.liu.se>
	<86sji2khqo.fsf@shell.gmplib.org>
Message-ID: <nnfwe2tus8.fsf@stalhein.lysator.liu.se>

Torbjorn Granlund <tg at gmplib.org> writes:

> The sandybridge machine tom behind shell is waiting for you.  A nehalem
> machine is biko* (lots of Xen machines on the same hardware) and
> repentium is a Conroe ("core2").

The best I find is

        mov     (up, n, 8), %rax
        mul     v
        mov     %rdx, c1
        add     (rp, n, 8), %rax
        adc     $0, c1
        add     %rax, c0
        adc     $0, c1
        mov     c0, (rp, n, 8)

Unrolled four times, that's 34 instructions. The best result from the
loop mixer so far has been 3.24 cycles / limb. See
shell:~nisse/hack/loopmix/lms/addmul_1-2.nlms. Which is the same as
your code, I guess.

A variant with one more instruction to move away %rax is at
shell:~nisse/hack/loopmix/lms/addmul_1.nlms seems to run at 3.52.

A variant with 7 instructions, but two add reg, mem operations also is
slow.

Regards,
/nisse

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From tg at gmplib.org  Thu Feb 23 07:51:26 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Thu, 23 Feb 2012 07:51:26 +0100
Subject: Sandybridge addmul_N challenge
In-Reply-To: <nnfwe2tus8.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?= message of "Wed\, 22 Feb 2012 22\:36\:39
	+0100")
References: <868vjxbvxl.fsf@shell.gmplib.org>
	<nnaa4buq5v.fsf@stalhein.lysator.liu.se>
	<86sji2khqo.fsf@shell.gmplib.org>
	<nnfwe2tus8.fsf@stalhein.lysator.liu.se>
Message-ID: <86fwe2qbyp.fsf@shell.gmplib.org>

nisse at lysator.liu.se (Niels M?ller) writes:

  The best I find is
  
          mov     (up, n, 8), %rax
          mul     v
          mov     %rdx, c1
          add     (rp, n, 8), %rax
          adc     $0, c1
          add     %rax, c0
          adc     $0, c1
          mov     c0, (rp, n, 8)
  
  Unrolled four times, that's 34 instructions. The best result from the
  loop mixer so far has been 3.24 cycles / limb. See
  shell:~nisse/hack/loopmix/lms/addmul_1-2.nlms. Which is the same as
  your code, I guess.
  
My code looked like 3.16 in the loopmixer, then runs at 3.25 outside of
it.  If you can make your smaller code actually run at 3.25, it is an
improvement.

I think we should focus not on addmul_1 but on mul_basecase,
sqr_basecase, redc_1, perhaps redc_2.  I.e., please focus on addmul_2
(or addmul_N, N > 2) or vertical multiplication primitives.

-- 
Torbj?rn

From nisse at lysator.liu.se  Thu Feb 23 09:53:54 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Thu, 23 Feb 2012 09:53:54 +0100
Subject: Sandybridge addmul_N challenge
In-Reply-To: <86fwe2qbyp.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Thu, 23 Feb 2012 07:51:26 +0100")
References: <868vjxbvxl.fsf@shell.gmplib.org>
	<nnaa4buq5v.fsf@stalhein.lysator.liu.se>
	<86sji2khqo.fsf@shell.gmplib.org>
	<nnfwe2tus8.fsf@stalhein.lysator.liu.se>
	<86fwe2qbyp.fsf@shell.gmplib.org>
Message-ID: <nn39a2szfh.fsf@stalhein.lysator.liu.se>

Torbjorn Granlund <tg at gmplib.org> writes:

> I think we should focus not on addmul_1 but on mul_basecase,
> sqr_basecase, redc_1, perhaps redc_2.  I.e., please focus on addmul_2
> (or addmul_N, N > 2) or vertical multiplication primitives.

Here's a sketch of an adddmul_2 iteration using Karatsuba. I assume we
have vl, vh, vd = |vl - vh| and an appropriate sign vmask in registers
before the loop. Carry input in c0, c1, carry out in r2, r3.

	mov	(up), %rax
        mov	%rax, ul
        mul	vl		C low product
        mov	8(up), uh
        mov	%rax, r0
        mov	%rdx, r1
        lea	(uh, ul), %rax
        sub	uh, ul
        cmovnc	ul, %rax
        sbb	r3, r3
        mul	vd		C Middle product
        mov	r1, r2		C Add shifted low product
        add	r0, r1
        adc	$0, r2
        add	(rp), r0	C Add rp limbs
        adc	8(rp), r1
        adc	$0, r2		
        mov	%rax, p0
        mov	%rdx, p1
        mov	uh, %rax
        mul	vh		C High product
        xor	vmask, r3
        xor	r3, p0		C Conditionally negate, and add, middle product
        xor	r3, p1
        bt	$0, r3
        adc	p0, r1
        adc	p1, r2
        adc	$0, r3
        add	%rax, r1	C Add shifted high product
        adc	%rdx, r2
        adc	$0, r2
        add	c0, r0		C Add input carry limbs
        adc	c1, r1
        mov	c0, (rp)
        mov	c1, 8(rp)
        adc	%rax, r2	C Add high product
        adc	%rdx, r3

37 instructions, or 12.25 instructions per limb, excluding looping logic
(and it has to be unrolled twice, to use separate registers for input
and output carries).

I think the instruction count can be reduced a bit, at the cost of
higher pressure on the out-of-order execution.

* At least the moves to p0, p1 can be eliminated.

* One could also save
  some instructions from adding in c0, c1 earlier, and doing an in-place
  add to (rp) at the end, on the theory that the recurrency is less tight.

* I'm also not sure if the order of the three multiplications is the
  best one.
  
* I don't try to optimize the add HIGH(ul * vl) + LOW(uh * vh), which
  (if additions are organized in the right way) is done twice, I suspect
  it's going to be a bit painful since the carry has to be applied at
  two places.

What do you think? If one can get one iteration to run at 12 cycles,
that's 3 c/l and an improvement over addmul_1. If one can get it down to
11 or 11.5, one beats 3 c/l.

For a "vertical" mul_basecase, the quadratic work would be an iteration
of

	mov	(up), %rax
        mul	(vp)
        add	%rax, r0
        adc	%rax, r1
        adc	$0, r3

So there's potential for that to run at 2 cycles per limb product. But
then there's also a significant linear cost for accumulation and carrry
propagation, and possible bad branch-prediction due to loops of varying
lenghts.

Regards,
/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From tg at gmplib.org  Thu Feb 23 11:13:30 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Thu, 23 Feb 2012 11:13:30 +0100
Subject: Sandybridge addmul_N challenge
In-Reply-To: <nn39a2szfh.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?= message of "Thu\, 23 Feb 2012 09\:53\:54
	+0100")
References: <868vjxbvxl.fsf@shell.gmplib.org>
	<nnaa4buq5v.fsf@stalhein.lysator.liu.se>
	<86sji2khqo.fsf@shell.gmplib.org>
	<nnfwe2tus8.fsf@stalhein.lysator.liu.se>
	<86fwe2qbyp.fsf@shell.gmplib.org>
	<nn39a2szfh.fsf@stalhein.lysator.liu.se>
Message-ID: <86d395kgc5.fsf@shell.gmplib.org>

nisse at lysator.liu.se (Niels M?ller) writes:

  Here's a sketch of an adddmul_2 iteration using Karatsuba. I assume we
  have vl, vh, vd = |vl - vh| and an appropriate sign vmask in registers
  before the loop. Carry input in c0, c1, carry out in r2, r3.
  
          mov	(up), %rax
          mov	%rax, ul
          mul	vl		C low product
          mov	8(up), uh
          mov	%rax, r0
          mov	%rdx, r1
          lea	(uh, ul), %rax
          sub	uh, ul
          cmovnc	ul, %rax
          sbb	r3, r3
          mul	vd		C Middle product
          mov	r1, r2		C Add shifted low product
          add	r0, r1
          adc	$0, r2
          add	(rp), r0	C Add rp limbs
          adc	8(rp), r1
          adc	$0, r2		
          mov	%rax, p0
          mov	%rdx, p1
          mov	uh, %rax
          mul	vh		C High product
          xor	vmask, r3
          xor	r3, p0		C Conditionally negate, and add, middle product
          xor	r3, p1
          bt	$0, r3
          adc	p0, r1
          adc	p1, r2
          adc	$0, r3
          add	%rax, r1	C Add shifted high product
          adc	%rdx, r2
          adc	$0, r2
          add	c0, r0		C Add input carry limbs
          adc	c1, r1
          mov	c0, (rp)
          mov	c1, 8(rp)
          adc	%rax, r2	C Add high product
          adc	%rdx, r3
  
  37 instructions, or 12.25 instructions per limb, excluding looping logic
  (and it has to be unrolled twice, to use separate registers for input
  and output carries).
  
How did you arrive to 12.25 insns/limb?  I have not tried to understand
the code, but doesn't it perform a 2x2 limb multiply with accumulation?
That's 9.25 insn/limb product.

I very much doubt this will win for Sandybridge, unless you can decrease
the insn count with several instructions.  Unfortunately it has no
chance on Bull-dozer, since the latter has a 2 issue pipeline; you need
to beat 32 insns per 2x2 accumulation block there.

  I think the instruction count can be reduced a bit, at the cost of
  higher pressure on the out-of-order execution.
  
Perhaps some of the adc $0 could be eliminated with 2x unrolling?
  
  What do you think? If one can get one iteration to run at 12 cycles,
  that's 3 c/l and an improvement over addmul_1. If one can get it down to
  11 or 11.5, one beats 3 c/l.
  
If that is possible, it might not be enough...  See below.

  For a "vertical" mul_basecase, the quadratic work would be an iteration
  of
  
  	mov	(up), %rax
          mul	(vp)
          add	%rax, r0
          adc	%rax, r1
          adc	$0, r3
  
  So there's potential for that to run at 2 cycles per limb product. But
  then there's also a significant linear cost for accumulation and carrry
  propagation, and possible bad branch-prediction due to loops of varying
  lenghts.
  
Exactly.  (But the branch misprediction problem would not happen for for
David's mulmid_basecase, I suppose.)

Some ways to deal with the branch misprediction problem:

* Have straight line code for the corners, up to a limit.  This gets rid
  of the really high relative branch misprediction for these small
  areas.

* Handle two (or more) columns in parallel, and separately for the
  low-significant right triangle, any middle rectangular part, and the
  left triangle.  This doubles (or more) the amount of useful work per
  branch misprediction.

* I suppose that making full use of out-of-order execution just before a
  mispredicted branch would make sense.

I played a bit with mul_2 yesternight.  I am not 100% the code is
correct, but I think it is.  The loopmixer found a 2.5 c/l version of
it.

I started with genxmul.c (from the loopmixer repo) using these args:
"-n2 -w4 --mul".  I then analysed the critical path and determined that
it is about 24.  The problem is adc feeding other adc feeding other adc
though a register (remember that pure carry deps are fast on
Sandybridge).  So I mindlessly introduced 4 new registers, then using
alternating accumulation registers.  I needed 8 extra insn (in total,
corresponding to 2 per way) to pairwise sum accumulation registers.

I am sure this was not done optimally, but (assuming my code is sound)
it proves that there is a lot of performance headroom, as expected.

I conjecture that we could create an addmul_N for Sandybridge that runs
at <= 2.5 c/l.  I think this will be possible already for N=2.  Perhaps
we could arrive to 2.25 for N=4, matching the K8-K10 performance.

-- 
Torbj?rn

From nisse at lysator.liu.se  Thu Feb 23 16:09:56 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Thu, 23 Feb 2012 16:09:56 +0100
Subject: Sandybridge addmul_N challenge
In-Reply-To: <86d395kgc5.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Thu, 23 Feb 2012 11:13:30 +0100")
References: <868vjxbvxl.fsf@shell.gmplib.org>
	<nnaa4buq5v.fsf@stalhein.lysator.liu.se>
	<86sji2khqo.fsf@shell.gmplib.org>
	<nnfwe2tus8.fsf@stalhein.lysator.liu.se>
	<86fwe2qbyp.fsf@shell.gmplib.org>
	<nn39a2szfh.fsf@stalhein.lysator.liu.se>
	<86d395kgc5.fsf@shell.gmplib.org>
Message-ID: <nny5rtsi0r.fsf@stalhein.lysator.liu.se>

Torbjorn Granlund <tg at gmplib.org> writes:

> How did you arrive to 12.25 insns/limb?

I did someting wrong, I guess... 9.25 instructions / limb or 3.1 cycles
if we can issue 3 instructions per cycle.

> I very much doubt this will win for Sandybridge, unless you can decrease
> the insn count with several instructions.

One can decrease it a bit by adding c0, c1 earlier (do you think
recurrency can be a problem if we add c0, c1 to the first product?) and
doing an in-place add to (rp) and 8(rp) at the end.

I could get it down to 30 instructions with a deep carry recurrency, or
34 with a short one. I can get neither variant to run faster than 4 c/l.

I also had a quick look at doing Karatsuba based on (u0+u1)*(v0+v1).
It's about the same number of instructions, but the updates
from carries are independent of all products, so there's more more
freedom in where to move them around.

I think this idea may be more useful for other processors, without the
awkward hardwired mul registers. For documentation, this is what the
iteration should do:

	vd = |v1 - v0|, sign in vs (outside loop)
	
	ud = |u1 - u0|, sign in us
	s = us ^ vs ^ 1
	<p3, p2> = u1 * v1
	<m1, m0> = ud * vd ^ <-s, -s>
	<p1, p0> = u0 * v0

        +-----+-----+
	|p3 p2|p1 p0|
	+-----+--+--+
	      |c1 c0|
	      +-----+
	      |r1 r0|
	   +--+--+--+
	   |p1 p0|
	   +-----+
	   |p3 p2|
	+--+--+--+
	|-s| 0| s|
	+--+--+--+
	   |m1 m0|
      --+--+-----+--+
        |c3 c2 r1 r0|
        +-----------+

or

	<vc, vs> = v1 + v0 (outside loop)
	
	<uc, us> = u1 + u0
	<p3, p2> = u1 * v1
	<m1, m0> = us * vs
	<p1, p0> = u0 * v0
	
        +-----+-----+
	|p3 p2|p1 p0|
	+-----+--+--+
	      |c1 c0|
	      +-----+
	      |r1 r0|
	   +--+--+--+
	-  |p1 p0|
	   +-----+
	-  |p3 p2|
	   +-----+
	   |m1 m0|
	+--+--+--+
 	|vc vs|    if uc
	+--+--+
	   |us|    if vc
    --+--+--+-----+
      |c3 c2 r1 r0|
      +-----------+

> Perhaps some of the adc $0 could be eliminated with 2x unrolling?

In effect, that would be a kind of 4x2 multiply. Which would then be
done as two 2x2 (I think the high limbs one get from evaluation rules
out using toom32 or toom42).

Haven't tried that. I suspect one will run out of registers.

> I played a bit with mul_2 yesternight.  I am not 100% the code is
> correct, but I think it is.  The loopmixer found a 2.5 c/l version of
> it.

Nice. I've now wasted quite some time... It seems really
difficult.

Now I also tried a very basic variant of addmul_2, doing only one u limb
per iteration and multiplying it by the two v limbs. Even if I have a
very nice carry recurrence between iterations, add, adc, adc $0, four
cycles, and a small number of instructions (15 per iteration, 32 for the
twice unrolled loop), which one might think could be executed in 11
cycles or 5.5 / iteration or 2.75 cycles per limb product. But it won't
run faster than 6.5 cycles per iteration, or 3.25 c/l.

So it just seems very difficult to convince the cpu to really execute
the independent operations, outside of the recurrency, in parallel.

BTW, are any of the SSE3 etc instructions useful here?

Regards,
/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From tg at gmplib.org  Thu Feb 23 17:44:21 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Thu, 23 Feb 2012 17:44:21 +0100
Subject: Sandybridge addmul_N challenge
In-Reply-To: <nny5rtsi0r.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?= message of "Thu\, 23 Feb 2012 16\:09\:56
	+0100")
References: <868vjxbvxl.fsf@shell.gmplib.org>
	<nnaa4buq5v.fsf@stalhein.lysator.liu.se>
	<86sji2khqo.fsf@shell.gmplib.org>
	<nnfwe2tus8.fsf@stalhein.lysator.liu.se>
	<86fwe2qbyp.fsf@shell.gmplib.org>
	<nn39a2szfh.fsf@stalhein.lysator.liu.se>
	<86d395kgc5.fsf@shell.gmplib.org>
	<nny5rtsi0r.fsf@stalhein.lysator.liu.se>
Message-ID: <868vjtijoa.fsf@shell.gmplib.org>

nisse at lysator.liu.se (Niels M?ller) writes:

  One can decrease it a bit by adding c0, c1 earlier (do you think
  recurrency can be a problem if we add c0, c1 to the first product?) and
  doing an in-place add to (rp) and 8(rp) at the end.
  
  I could get it down to 30 instructions with a deep carry recurrency, or
  34 with a short one. I can get neither variant to run faster than 4 c/l.
  
In loopmixer or manually?  I wouldn't draw any conclusions without
mixing the code first...

  I also had a quick look at doing Karatsuba based on (u0+u1)*(v0+v1).

Meaning evaluating in +1 instead of -1, I assume.

  It's about the same number of instructions, but the updates from
  carries are independent of all products, so there's more more freedom
  in where to move them around.

  I think this idea may be more useful for other processors, without the
  awkward hardwired mul registers.

True.
  
  > I played a bit with mul_2 yesternight.  I am not 100% the code is
  > correct, but I think it is.  The loopmixer found a 2.5 c/l version of
  > it.
  
  Nice. I've now wasted quite some time... It seems really difficult.

It is challenging, but I am getting convinced we can really speed things
a lot.
  
  Now I also tried a very basic variant of addmul_2, doing only one u limb
  per iteration and multiplying it by the two v limbs. Even if I have a
  very nice carry recurrence between iterations, add, adc, adc $0, four
  cycles, and a small number of instructions (15 per iteration, 32 for the
  twice unrolled loop), which one might think could be executed in 11
  cycles or 5.5 / iteration or 2.75 cycles per limb product. But it won't
  run faster than 6.5 cycles per iteration, or 3.25 c/l.
  
  So it just seems very difficult to convince the cpu to really execute
  the independent operations, outside of the recurrency, in parallel.
  
Did you compute the recurrency chain?  Annotating the instructions on
the recurrency chain helps understanding the problem.

My experience of Sandybridge is that with load/store coding style, the
CPU typically executes 3 insn/cycle unless there is a recurrency
dependency stopping that.

  BTW, are any of the SSE3 etc instructions useful here?
  
I don't think there are.  These instructions are mostly FP plus narrow
integer ops.  

-- 
Torbj?rn

From nisse at lysator.liu.se  Thu Feb 23 21:00:41 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Thu, 23 Feb 2012 21:00:41 +0100
Subject: Sandybridge addmul_N challenge
In-Reply-To: <868vjtijoa.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Thu, 23 Feb 2012 17:44:21 +0100")
References: <868vjxbvxl.fsf@shell.gmplib.org>
	<nnaa4buq5v.fsf@stalhein.lysator.liu.se>
	<86sji2khqo.fsf@shell.gmplib.org>
	<nnfwe2tus8.fsf@stalhein.lysator.liu.se>
	<86fwe2qbyp.fsf@shell.gmplib.org>
	<nn39a2szfh.fsf@stalhein.lysator.liu.se>
	<86d395kgc5.fsf@shell.gmplib.org>
	<nny5rtsi0r.fsf@stalhein.lysator.liu.se>
	<868vjtijoa.fsf@shell.gmplib.org>
Message-ID: <nnty2hs4k6.fsf@stalhein.lysator.liu.se>

Torbjorn Granlund <tg at gmplib.org> writes:

> In loopmixer or manually?  I wouldn't draw any conclusions without
> mixing the code first...

With the loop mixer.

> Meaning evaluating in +1 instead of -1, I assume.

Exactly.

> Did you compute the recurrency chain?  Annotating the instructions on
> the recurrency chain helps understanding the problem.

I tried. I use this iteration 

        mov     (up, n, 8), %rax
        mov     %rax, u
        mul     v0
        mov     %rax, r0
        mov     %rdx, r1
        mov     u, %rax
        mul     v1
        mov     (rp, n, 8), t 
        add     t, r0  
        add     %rax, r1
        mov     %rdx, r2
        adc     $0, r2
        add     c0, r0
	mov     r0, (rp, n, 8)
	adc     c1, r1
        adc     $0, r2

For the recurrency, the inputs are c0, c1, and the outputs are r1, r2.
Let's write the interesting instructions out and unroll twice (using
different registers),

        add     c0, r0		C  0  6
	adc     c1, r1		C  1  7
        adc     $0, r2		C  3  9

        add     r1, c2		C  3  9
	adc     r2, c0		C  4  10
	adc     $0, c1		C  6  12

So the recurrency, for one iteration, seems to be just 3 cycles. But the
loop mixer doesn't find anything faster then 6.36 cycles for one
iteration, or 3.18 per limb product. Which isn't too bad (a slight
improvement over 3.24, which I think is the best reported earlier), but
stubbornly above 3 c/l.

> My experience of Sandybridge is that with load/store coding style, the
> CPU typically executes 3 insn/cycle unless there is a recurrency
> dependency stopping that.

If we could get there, the above loop should run just below 3 c/l.

> I don't think there are.  These instructions are mostly FP plus narrow
> integer ops.  

Hmm. Last time I looked at that was in a 32-bit context. There's a
32x32->64 instruction which might be useful for a 32-bit build, at least
in theory, but as far as I can find in the manual, the latest
ss*-extensions don't provide any wider multiplication than that.

Regards,
/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From nisse at lysator.liu.se  Thu Feb 23 21:09:32 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Thu, 23 Feb 2012 21:09:32 +0100
Subject: Sandybridge addmul_N challenge
In-Reply-To: <nnty2hs4k6.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?=
	message of "Thu, 23 Feb 2012 21:00:41 +0100")
References: <868vjxbvxl.fsf@shell.gmplib.org>
	<nnaa4buq5v.fsf@stalhein.lysator.liu.se>
	<86sji2khqo.fsf@shell.gmplib.org>
	<nnfwe2tus8.fsf@stalhein.lysator.liu.se>
	<86fwe2qbyp.fsf@shell.gmplib.org>
	<nn39a2szfh.fsf@stalhein.lysator.liu.se>
	<86d395kgc5.fsf@shell.gmplib.org>
	<nny5rtsi0r.fsf@stalhein.lysator.liu.se>
	<868vjtijoa.fsf@shell.gmplib.org>
	<nnty2hs4k6.fsf@stalhein.lysator.liu.se>
Message-ID: <nnlints45f.fsf@stalhein.lysator.liu.se>

nisse at lysator.liu.se (Niels M?ller) writes:

> So the recurrency, for one iteration, seems to be just 3 cycles. But the
> loop mixer doesn't find anything faster then 6.36 cycles for one
> iteration, or 3.18 per limb product. Which isn't too bad (a slight
> improvement over 3.24, which I think is the best reported earlier), but
> stubbornly above 3 c/l.

One update. I have now tried unrolling four times. Then I've seen one
sequence running at 6.16 cycles per iteration, or 3.08 c/l.

See shell:~nisse/hack/loopmix/lms/addmul_2-nisse-2.nlms.

Regards,
/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From tg at gmplib.org  Thu Feb 23 21:38:27 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Thu, 23 Feb 2012 21:38:27 +0100
Subject: Sandybridge addmul_N challenge
In-Reply-To: <nnty2hs4k6.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?= message of "Thu\, 23 Feb 2012 21\:00\:41
	+0100")
References: <868vjxbvxl.fsf@shell.gmplib.org>
	<nnaa4buq5v.fsf@stalhein.lysator.liu.se>
	<86sji2khqo.fsf@shell.gmplib.org>
	<nnfwe2tus8.fsf@stalhein.lysator.liu.se>
	<86fwe2qbyp.fsf@shell.gmplib.org>
	<nn39a2szfh.fsf@stalhein.lysator.liu.se>
	<86d395kgc5.fsf@shell.gmplib.org>
	<nny5rtsi0r.fsf@stalhein.lysator.liu.se>
	<868vjtijoa.fsf@shell.gmplib.org>
	<nnty2hs4k6.fsf@stalhein.lysator.liu.se>
Message-ID: <864nuhp9oc.fsf@shell.gmplib.org>

nisse at lysator.liu.se (Niels M?ller) writes:

  For the recurrency, the inputs are c0, c1, and the outputs are r1, r2.
  Let's write the interesting instructions out and unroll twice (using
  different registers),
  
          add     c0, r0		C  0  6
          adc     c1, r1		C  1  7
          adc     $0, r2		C  3  9
  
          add     r1, c2		C  3  9
          adc     r2, c0		C  4  10
          adc     $0, c1		C  6  12
  
  So the recurrency, for one iteration, seems to be just 3 cycles. But the
  loop mixer doesn't find anything faster then 6.36 cycles for one
  iteration, or 3.18 per limb product. Which isn't too bad (a slight
  improvement over 3.24, which I think is the best reported earlier), but
  stubbornly above 3 c/l.
  
I am playing with this block:

carry-in lo in r14
carry-in hi in rcx
        mov     0(up), %rax
        mul     v1
        mov     8(rp), %r8
        add     %rax, %r8
        mov     %rdx, %r9
        adc     $0, %r9
        mov     8(up), %rax
        mul     v0
        add     %rax, %r8
        adc     %rdx, %r9
        mov     $0, R32(%rbx)
        adc     $0, R32(%rbx)
        add     %r14, %r8               C 0
        adc     %rcx, %r9               C 1
        adc     $0, R32(%rbx)           C might be removed
        mov     %r8, 8(rp)
carry-out lo in r9
carry-out hi in rbx

This is not identical to your block, I think.  It runs at exactly 3 c/l.
The recurrency path is extremely shallow, at 1.5 c/l.

If we slightly restrict the operand range, we could remove the indicated
carry propagation insn.  Then the code runs at 2.8 c/l.

Neither is decoding bandwidth limited,

It is further possible to supplant the 'mov $0,reg' and following 'adc
$0,reg' with 'setc reg'.  This creates a false dependency (on the upper
56 bits) and seems to run at about the same speed.

The plain code (i.e., the code which runs at 3.0 c/l) runs at 3-epsilon
if the lea pointer update insns are removed.  This is a good sign,
proving there is no magic stopping us at 3 c/l...

  > My experience of Sandybridge is that with load/store coding style, the
  > CPU typically executes 3 insn/cycle unless there is a recurrency
  > dependency stopping that.
  
  If we could get there, the above loop should run just below 3 c/l.
  
I was obviously wrong.  :-(

  Hmm. Last time I looked at that was in a 32-bit context. There's a
  32x32->64 instruction which might be useful for a 32-bit build, at least
  in theory, but as far as I can find in the manual, the latest
  ss*-extensions don't provide any wider multiplication than that.

I believe that insn is used for 32-bit builds, where it helps.  Much
improvments could be done for 32-bit builds, if one care

(I see a new mail has arrived, will read now.)

-- 
Torbj?rn

From tg at gmplib.org  Thu Feb 23 22:30:51 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Thu, 23 Feb 2012 22:30:51 +0100
Subject: Sandybridge addmul_N challenge
In-Reply-To: <864nuhp9oc.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Thu\, 23 Feb 2012 21\:38\:27 +0100")
References: <868vjxbvxl.fsf@shell.gmplib.org>
	<nnaa4buq5v.fsf@stalhein.lysator.liu.se>
	<86sji2khqo.fsf@shell.gmplib.org>
	<nnfwe2tus8.fsf@stalhein.lysator.liu.se>
	<86fwe2qbyp.fsf@shell.gmplib.org>
	<nn39a2szfh.fsf@stalhein.lysator.liu.se>
	<86d395kgc5.fsf@shell.gmplib.org>
	<nny5rtsi0r.fsf@stalhein.lysator.liu.se>
	<868vjtijoa.fsf@shell.gmplib.org>
	<nnty2hs4k6.fsf@stalhein.lysator.liu.se>
	<864nuhp9oc.fsf@shell.gmplib.org>
Message-ID: <86y5rtnsok.fsf@shell.gmplib.org>

Torbjorn Granlund <tg at gmplib.org> writes:

  If we slightly restrict the operand range, we could remove the indicated
  carry propagation insn.
  
Wrong.  Those carry propagation insns are needed.

-- 
Torbj?rn

From bodrato at mail.dm.unipi.it  Fri Feb 24 09:21:46 2012
From: bodrato at mail.dm.unipi.it (bodrato at mail.dm.unipi.it)
Date: Fri, 24 Feb 2012 09:21:46 +0100 (CET)
Subject: _mp_alloc vs ALLOC
In-Reply-To: <86zkcar7el.fsf@shell.gmplib.org>
References: <alpine.DEB.2.02.1202191341100.3545@laptop-mg.saclay.inria.fr>
	<86d396soc2.fsf@shell.gmplib.org>
	<49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it>
	<86zkcar7el.fsf@shell.gmplib.org>
Message-ID: <49731.151.32.246.62.1330071706.squirrel@mail.dm.unipi.it>


Il Mer, 22 Febbraio 2012 8:32 pm, Torbjorn Granlund ha scritto:
> bodrato at mail.dm.unipi.it writes:
>   Unrelated :-) We might define more macros like TMP_ALLOC_LIMBS_2 . I

> Please look at the generated code from TMP_ALLOC from any reasonable
> compiler.  It is a sub from sp, the a copy from sp to the target
> variable.  Cost: about 1 cycle.

	sal	$3, %n
	cmpl	$65535, %n
	ja	.Lunlikelybranch
	add	$30, %n
	and	$-16, %n
	sub	%n, %esp

.Lunlikelybranchreturnshere



> TMP_ALLOC_LIMBS_2 is clutter IMHO.


-- 
http://bodrato.it/


From bodrato at mail.dm.unipi.it  Fri Feb 24 09:23:36 2012
From: bodrato at mail.dm.unipi.it (bodrato at mail.dm.unipi.it)
Date: Fri, 24 Feb 2012 09:23:36 +0100 (CET)
Subject: _mp_alloc vs ALLOC
In-Reply-To: <49731.151.32.246.62.1330071707.squirrel@mail.dm.unipi.it>
References: <alpine.DEB.2.02.1202191341100.3545@laptop-mg.saclay.inria.fr>
	<86d396soc2.fsf@shell.gmplib.org>
	<49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it>
	<86zkcar7el.fsf@shell.gmplib.org>
	<49731.151.32.246.62.1330071707.squirrel@mail.dm.unipi.it>
Message-ID: <49732.151.32.246.62.1330071816.squirrel@mail.dm.unipi.it>

Sorry, I sent the previous mail by mistake...

-- 
http://bodrato.it/


From nisse at lysator.liu.se  Fri Feb 24 10:01:23 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Fri, 24 Feb 2012 10:01:23 +0100
Subject: _mp_alloc vs ALLOC
In-Reply-To: <86pqd6r5zk.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Wed, 22 Feb 2012 21:02:55 +0100")
References: <alpine.DEB.2.02.1202191341100.3545@laptop-mg.saclay.inria.fr>
	<86d396soc2.fsf@shell.gmplib.org>
	<49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it>
	<86zkcar7el.fsf@shell.gmplib.org>
	<nnpqd6tzdg.fsf@stalhein.lysator.liu.se>
	<86pqd6r5zk.fsf@shell.gmplib.org>
Message-ID: <nnhaygsizg.fsf@stalhein.lysator.liu.se>

Torbjorn Granlund <tg at gmplib.org> writes:

> Surely a plain TMP_ALLOC adds red zones?  If not, that is something we
> ought to fix.

But

  tp = TMP_ALLOC_LIMBS (2*n);
  xp = tp + n;

does not add any between T and X (intended to hold n limbs each). So if
one doesn't use TMP_ALLOC_LIMBS_2, one should instead write

  tp = TMP_ALLOC_LIMBS (n);
  xp = TMP_ALLOC_LIMBS (n);

to get red zones for this common case. Right?

Maybe this is more overhead, in the non-debug case, than using
TMP_ALLOC_LIMBS_2. I'm not sure.

Regards,
/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From tg at gmplib.org  Fri Feb 24 10:11:33 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Fri, 24 Feb 2012 10:11:33 +0100
Subject: _mp_alloc vs ALLOC
In-Reply-To: <nnhaygsizg.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?= message of "Fri\, 24 Feb 2012 10\:01\:23
	+0100")
References: <alpine.DEB.2.02.1202191341100.3545@laptop-mg.saclay.inria.fr>
	<86d396soc2.fsf@shell.gmplib.org>
	<49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it>
	<86zkcar7el.fsf@shell.gmplib.org>
	<nnpqd6tzdg.fsf@stalhein.lysator.liu.se>
	<86pqd6r5zk.fsf@shell.gmplib.org>
	<nnhaygsizg.fsf@stalhein.lysator.liu.se>
Message-ID: <861upkioje.fsf@shell.gmplib.org>

nisse at lysator.liu.se (Niels M?ller) writes:

  Torbjorn Granlund <tg at gmplib.org> writes:
  
  > Surely a plain TMP_ALLOC adds red zones?  If not, that is something we
  > ought to fix.
  
  But
  
    tp = TMP_ALLOC_LIMBS (2*n);
    xp = tp + n;
  
  does not add any between T and X (intended to hold n limbs each). So if
  one doesn't use TMP_ALLOC_LIMBS_2, one should instead write
  
    tp = TMP_ALLOC_LIMBS (n);
    xp = TMP_ALLOC_LIMBS (n);
  
  to get red zones for this common case. Right?
  
  Maybe this is more overhead, in the non-debug case, than using
  TMP_ALLOC_LIMBS_2. I'm not sure.
  
TMP_ALLOC_LIMBS_2 makes two TMP_ALLOC_LIMBS calls if WANT_TMP_DEBUG,
else one.  So the red zones will be there when we want them.

I think the conclusion is that TMP_ALLOC_LIMBS_2 could save some cycles
by collapsing two malloc calls into one, when allocating large blocks
using TMP_ALLOC (as opposed to TMP_BALLOC or TMP_SALLOC).

My idea is that these cycles saved are unimportant, following GMP's
founding principle of relative overhead: Adding a million cycles to a
billion cycle computation does not matter, but adding 1 cycle to a 10
cycle computation is unforgivable.

-- 
Torbj?rn

From nisse at lysator.liu.se  Fri Feb 24 10:27:17 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Fri, 24 Feb 2012 10:27:17 +0100
Subject: _mp_alloc vs ALLOC
In-Reply-To: <861upkioje.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Fri, 24 Feb 2012 10:11:33 +0100")
References: <alpine.DEB.2.02.1202191341100.3545@laptop-mg.saclay.inria.fr>
	<86d396soc2.fsf@shell.gmplib.org>
	<49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it>
	<86zkcar7el.fsf@shell.gmplib.org>
	<nnpqd6tzdg.fsf@stalhein.lysator.liu.se>
	<86pqd6r5zk.fsf@shell.gmplib.org>
	<nnhaygsizg.fsf@stalhein.lysator.liu.se>
	<861upkioje.fsf@shell.gmplib.org>
Message-ID: <nnd394shsa.fsf@stalhein.lysator.liu.se>

Torbjorn Granlund <tg at gmplib.org> writes:

> I think the conclusion is that TMP_ALLOC_LIMBS_2 could save some cycles
> by collapsing two malloc calls into one, when allocating large blocks
> using TMP_ALLOC (as opposed to TMP_BALLOC or TMP_SALLOC).

What about the test in

#define TMP_ALLOC(n) \
   (LIKELY ((n) < 65536) ? TMP_SALLOC(n) : TMP_BALLOC(n))

That test will cost a cycle or two for each TMP_ALLOC call (with
non-constant n), regardless of size, won't it?

Regards,
/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From tg at gmplib.org  Fri Feb 24 10:40:00 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Fri, 24 Feb 2012 10:40:00 +0100
Subject: _mp_alloc vs ALLOC
In-Reply-To: <nnd394shsa.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?= message of "Fri\, 24 Feb 2012 10\:27\:17
	+0100")
References: <alpine.DEB.2.02.1202191341100.3545@laptop-mg.saclay.inria.fr>
	<86d396soc2.fsf@shell.gmplib.org>
	<49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it>
	<86zkcar7el.fsf@shell.gmplib.org>
	<nnpqd6tzdg.fsf@stalhein.lysator.liu.se>
	<86pqd6r5zk.fsf@shell.gmplib.org>
	<nnhaygsizg.fsf@stalhein.lysator.liu.se>
	<861upkioje.fsf@shell.gmplib.org>
	<nnd394shsa.fsf@stalhein.lysator.liu.se>
Message-ID: <86r4xkh8nj.fsf@shell.gmplib.org>

nisse at lysator.liu.se (Niels M?ller) writes:

  What about the test in
  
  #define TMP_ALLOC(n) \
     (LIKELY ((n) < 65536) ? TMP_SALLOC(n) : TMP_BALLOC(n))
  
  That test will cost a cycle or two for each TMP_ALLOC call (with
  non-constant n), regardless of size, won't it?
  
I think my previous statement "1 cycle" should be amended to "2 cycles".

A correctly predicted compare-and-branch cost 1-2 cycles, with a
throughput of 1 per cycle (on any modern machine).  The allocation code
will run in parallel with the branch (assuming again correct prediction).

I cannot see how TMP_ALLOC_LIMBS_2 could save *anything* for small
allocations, since it basically performs the same operations.  I.e., the
net cost of splitting TMP_ALLOC_LIMBS_2 into two TMP_ALLOC_LIMBS is 0.
But it might be +-1 depending on alignment and all sorts of magic.

-- 
Torbj?rn

From tg at gmplib.org  Fri Feb 24 10:55:47 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Fri, 24 Feb 2012 10:55:47 +0100
Subject: Division call in mpn_gcd
Message-ID: <86ipiwh7x8.fsf@shell.gmplib.org>

Inspired by Marc{,o}'s cleanup commits, I decided to look for TMP_ALLOC*
calls that should be made into TMP_SALLOC* or TMP_BALLOC*.  Then I
spotted this unrelated thing:

  tp = TMP_ALLOC_LIMBS(talloc);

  if (usize > n)
    {
      mpn_tdiv_qr (tp, up, 0, up, usize, vp, n);

      if (mpn_zero_p (up, n))
	{
	  MPN_COPY (gp, vp, n);
	  ctx.gn = n;
	  goto done;
	}
    }

Why is mpn_tdiv_qr used here, the quotient should be irrelevent?  I'd
say to use mpn_bdiv_qr instead, to streamline things (followed by a
right shift to get rid of low zeros)?

After all, if g=gcd(a,b) then g | a and g | b, and g | (a + b*c) for any
a,b,c in Z.

-- 
Torbj?rn

From bodrato at mail.dm.unipi.it  Fri Feb 24 11:08:37 2012
From: bodrato at mail.dm.unipi.it (bodrato at mail.dm.unipi.it)
Date: Fri, 24 Feb 2012 11:08:37 +0100 (CET)
Subject: _mp_alloc vs ALLOC
In-Reply-To: <86r4xkh8nj.fsf@shell.gmplib.org>
References: <alpine.DEB.2.02.1202191341100.3545@laptop-mg.saclay.inria.fr>
	<86d396soc2.fsf@shell.gmplib.org>
	<49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it>
	<86zkcar7el.fsf@shell.gmplib.org>
	<nnpqd6tzdg.fsf@stalhein.lysator.liu.se>
	<86pqd6r5zk.fsf@shell.gmplib.org>
	<nnhaygsizg.fsf@stalhein.lysator.liu.se>
	<861upkioje.fsf@shell.gmplib.org>
	<nnd394shsa.fsf@stalhein.lysator.liu.se>
	<86r4xkh8nj.fsf@shell.gmplib.org>
Message-ID: <49998.151.32.246.62.1330078117.squirrel@mail.dm.unipi.it>

Ciao,

Il Ven, 24 Febbraio 2012 10:40 am, Torbjorn Granlund ha scritto:
> I cannot see how TMP_ALLOC_LIMBS_2 could save *anything* for small

I'm not sure I agree with Torbjorn. Nevertheless developers time is a far
more precious resource than a few cpu cycles or bytes of code size...
That's why I completely change my question, always about allocations.


I always used the MPZ_REALLOC macro, to enlarge (if needed) the memory
area available for an integer. This macro gives a (possibly new) pointer
with the requested size available... but it also copies the content.

Sometimes I know in advance that the content can be discarded. Is there a
standard way to ensure a given size without moving data?


Regards,
m

-- 
http://bodrato.it/papers/


From tg at gmplib.org  Fri Feb 24 11:32:37 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Fri, 24 Feb 2012 11:32:37 +0100
Subject: _mp_alloc vs ALLOC
In-Reply-To: <49998.151.32.246.62.1330078117.squirrel@mail.dm.unipi.it>
	(bodrato@mail.dm.unipi.it's message of "Fri\,
	24 Feb 2012 11\:08\:37 +0100 \(CET\)")
References: <alpine.DEB.2.02.1202191341100.3545@laptop-mg.saclay.inria.fr>
	<86d396soc2.fsf@shell.gmplib.org>
	<49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it>
	<86zkcar7el.fsf@shell.gmplib.org>
	<nnpqd6tzdg.fsf@stalhein.lysator.liu.se>
	<86pqd6r5zk.fsf@shell.gmplib.org>
	<nnhaygsizg.fsf@stalhein.lysator.liu.se>
	<861upkioje.fsf@shell.gmplib.org>
	<nnd394shsa.fsf@stalhein.lysator.liu.se>
	<86r4xkh8nj.fsf@shell.gmplib.org>
	<49998.151.32.246.62.1330078117.squirrel@mail.dm.unipi.it>
Message-ID: <86ehtkh67u.fsf@shell.gmplib.org>

bodrato at mail.dm.unipi.it writes:

  Ciao,
  
  Il Ven, 24 Febbraio 2012 10:40 am, Torbjorn Granlund ha scritto:
  > I cannot see how TMP_ALLOC_LIMBS_2 could save *anything* for small
  
  I'm not sure I agree with Torbjorn. Nevertheless developers time is a far
  more precious resource than a few cpu cycles or bytes of code size...
  That's why I completely change my question, always about allocations.
  
  
  I always used the MPZ_REALLOC macro, to enlarge (if needed) the memory
  area available for an integer. This macro gives a (possibly new) pointer
  with the requested size available... but it also copies the content.
  
  Sometimes I know in advance that the content can be discarded. Is there a
  standard way to ensure a given size without moving data?
  
I use this trick for that:

  rp = realloc (rp, 1);
  rp = realloc (rp, newsize);

I suspect there is a lot to win from using such a trick, at least of the
old size was large.  But perhaps some malloc implementations are so slow
for finding a 1-byte block that it can also become slower?  (It is a
pity the C library provide only very primitive allocation functions.)

I am not sure how to deal with this in GMP.  We could add a flag field
in MPZ_REALLOC, or have special functions+macros.

Unfortunatly, we are somewhat constrained by the replaceable allocation
functions; changing the __gmp_reallocate_func type will break
compatibility with user code.  We will therefore need to make two
realloc calls by invoking the fairly high-level __gmp_reallocate_func
twice.  Oh well.

-- 
Torbj?rn

From bodrato at mail.dm.unipi.it  Fri Feb 24 11:41:10 2012
From: bodrato at mail.dm.unipi.it (bodrato at mail.dm.unipi.it)
Date: Fri, 24 Feb 2012 11:41:10 +0100 (CET)
Subject: _mp_alloc vs ALLOC
In-Reply-To: <86ehtkh67u.fsf@shell.gmplib.org>
References: <alpine.DEB.2.02.1202191341100.3545@laptop-mg.saclay.inria.fr>
	<86d396soc2.fsf@shell.gmplib.org>
	<49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it>
	<86zkcar7el.fsf@shell.gmplib.org>
	<nnpqd6tzdg.fsf@stalhein.lysator.liu.se>
	<86pqd6r5zk.fsf@shell.gmplib.org>
	<nnhaygsizg.fsf@stalhein.lysator.liu.se>
	<861upkioje.fsf@shell.gmplib.org>
	<nnd394shsa.fsf@stalhein.lysator.liu.se>
	<86r4xkh8nj.fsf@shell.gmplib.org>
	<49998.151.32.246.62.1330078117.squirrel@mail.dm.unipi.it>
	<86ehtkh67u.fsf@shell.gmplib.org>
Message-ID: <50044.151.32.246.62.1330080070.squirrel@mail.dm.unipi.it>

Ciao,

Il Ven, 24 Febbraio 2012 11:32 am, Torbjorn Granlund ha scritto:
> bodrato at mail.dm.unipi.it writes:
>   I always used the MPZ_REALLOC macro, to enlarge (if needed) the memory
>   area available for an integer. This macro gives a (possibly new) pointer
>   with the requested size available... but it also copies the content.
>
>   Sometimes I know in advance that the content can be discarded. Is there
>   a standard way to ensure a given size without moving data?
>
> I use this trick for that:
>
>   rp = realloc (rp, 1);
>   rp = realloc (rp, newsize);

Inspired from mpz/mul.c... Maybe we can write a macro based on:

(*__gmp_free_func) (PTR(x), ALLOC (x) * BYTES_PER_MP_LIMB);
ALLOC (x) = newsize;
PTR(x) = (mp_ptr) (*__gmp_allocate_func) (newsize * BYTES_PER_MP_LIMB);

?

Regards,
m

-- 
http://bodrato.it/


From nisse at lysator.liu.se  Fri Feb 24 11:44:39 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Fri, 24 Feb 2012 11:44:39 +0100
Subject: Division call in mpn_gcd
In-Reply-To: <86ipiwh7x8.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Fri, 24 Feb 2012 10:55:47 +0100")
References: <86ipiwh7x8.fsf@shell.gmplib.org>
Message-ID: <nn8vjsse7c.fsf@stalhein.lysator.liu.se>

Torbjorn Granlund <tg at gmplib.org> writes:

>   if (usize > n)
>     {
>       mpn_tdiv_qr (tp, up, 0, up, usize, vp, n);

> Why is mpn_tdiv_qr used here, the quotient should be irrelevent?

You're right that the quotient is not wanted, mpn_div_r would make more
sense, but that function doesn't exist.

> I'd say to use mpn_bdiv_qr instead, to streamline things (followed by
> a right shift to get rid of low zeros)?

We don't require that v is odd (maybe it was a mistake to drop that
requirement?). So to use any bdiv funtions, we'd first have to deal with
powers of two upfront. And the quotient is still not needed, so I think
one would want to use bdiv_r, aka redc.

So I think the right method is:

1. In case both u and v are even, do needed book-keeping for the power
   of two in the gcd.

2. Drop trailing zeros of v.

3. Reduce the size of u using redc. Unlike the use in powm, there's no
   amortization of the inverse computation, so we may need new
   thresholds.

Regards,
/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From tg at gmplib.org  Fri Feb 24 11:47:43 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Fri, 24 Feb 2012 11:47:43 +0100
Subject: _mp_alloc vs ALLOC
In-Reply-To: <50044.151.32.246.62.1330080070.squirrel@mail.dm.unipi.it>
	(bodrato@mail.dm.unipi.it's message of "Fri\,
	24 Feb 2012 11\:41\:10 +0100 \(CET\)")
References: <alpine.DEB.2.02.1202191341100.3545@laptop-mg.saclay.inria.fr>
	<86d396soc2.fsf@shell.gmplib.org>
	<49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it>
	<86zkcar7el.fsf@shell.gmplib.org>
	<nnpqd6tzdg.fsf@stalhein.lysator.liu.se>
	<86pqd6r5zk.fsf@shell.gmplib.org>
	<nnhaygsizg.fsf@stalhein.lysator.liu.se>
	<861upkioje.fsf@shell.gmplib.org>
	<nnd394shsa.fsf@stalhein.lysator.liu.se>
	<86r4xkh8nj.fsf@shell.gmplib.org>
	<49998.151.32.246.62.1330078117.squirrel@mail.dm.unipi.it>
	<86ehtkh67u.fsf@shell.gmplib.org>
	<50044.151.32.246.62.1330080070.squirrel@mail.dm.unipi.it>
Message-ID: <868vjsh5io.fsf@shell.gmplib.org>

bodrato at mail.dm.unipi.it writes:

  Inspired from mpz/mul.c... Maybe we can write a macro based on:
  
  (*__gmp_free_func) (PTR(x), ALLOC (x) * BYTES_PER_MP_LIMB);
  ALLOC (x) = newsize;
  PTR(x) = (mp_ptr) (*__gmp_allocate_func) (newsize * BYTES_PER_MP_LIMB);
  
That's another possibility.

-- 
Torbj?rn

From nisse at lysator.liu.se  Fri Feb 24 12:26:20 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Fri, 24 Feb 2012 12:26:20 +0100
Subject: _mp_alloc vs ALLOC
In-Reply-To: <50044.151.32.246.62.1330080070.squirrel@mail.dm.unipi.it>
	(bodrato@mail.dm.unipi.it's message of "Fri, 24 Feb 2012 11:41:10
	+0100 (CET)")
References: <alpine.DEB.2.02.1202191341100.3545@laptop-mg.saclay.inria.fr>
	<86d396soc2.fsf@shell.gmplib.org>
	<49618.151.32.247.72.1329938932.squirrel@mail.dm.unipi.it>
	<86zkcar7el.fsf@shell.gmplib.org>
	<nnpqd6tzdg.fsf@stalhein.lysator.liu.se>
	<86pqd6r5zk.fsf@shell.gmplib.org>
	<nnhaygsizg.fsf@stalhein.lysator.liu.se>
	<861upkioje.fsf@shell.gmplib.org>
	<nnd394shsa.fsf@stalhein.lysator.liu.se>
	<86r4xkh8nj.fsf@shell.gmplib.org>
	<49998.151.32.246.62.1330078117.squirrel@mail.dm.unipi.it>
	<86ehtkh67u.fsf@shell.gmplib.org>
	<50044.151.32.246.62.1330080070.squirrel@mail.dm.unipi.it>
Message-ID: <nn4nugsc9v.fsf@stalhein.lysator.liu.se>

bodrato at mail.dm.unipi.it writes:

> Inspired from mpz/mul.c... Maybe we can write a macro based on:
>
> (*__gmp_free_func) (PTR(x), ALLOC (x) * BYTES_PER_MP_LIMB);
> ALLOC (x) = newsize;
> PTR(x) = (mp_ptr) (*__gmp_allocate_func) (newsize * BYTES_PER_MP_LIMB);

That's going to be more expensive in the case that the allocator could
grow the block in place. I imagine that's a likely case when the new
size is just slightly larger than the old size.

Regards,
/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From tg at gmplib.org  Fri Feb 24 13:27:36 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Fri, 24 Feb 2012 13:27:36 +0100
Subject: Division call in mpn_gcd
In-Reply-To: <nn8vjsse7c.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?= message of "Fri\, 24 Feb 2012 11\:44\:39
	+0100")
References: <86ipiwh7x8.fsf@shell.gmplib.org>
	<nn8vjsse7c.fsf@stalhein.lysator.liu.se>
Message-ID: <864nugh0w7.fsf@shell.gmplib.org>

nisse at lysator.liu.se (Niels M?ller) writes:

  You're right that the quotient is not wanted, mpn_div_r would make more
  sense, but that function doesn't exist.
  
Indeed it doesn't.  Nor does bdiv_r (for independent dividend and
divisor sizes).

  We don't require that v is odd (maybe it was a mistake to drop that
  requirement?). So to use any bdiv funtions, we'd first have to deal with
  powers of two upfront. And the quotient is still not needed, so I think
  one would want to use bdiv_r, aka redc.
  
Except that redc does not accept independent sizes.  Therefore we need
to use bdiv_qr.

  So I think the right method is:
  
  1. In case both u and v are even, do needed book-keeping for the power
     of two in the gcd.
  
  2. Drop trailing zeros of v.
  
  3. Reduce the size of u using redc. Unlike the use in powm, there's no
     amortization of the inverse computation, so we may need new
     thresholds.
  
Or perhaps slightly differently:

1. Drop trailing zeros of v, keep the count as vcnt

2. If u is even, drop its trailing zeros (might be lazy about that, only
   dropping <= vcnt zeros to save time in this part and keep u > v
   (approximatively u > v, I know this is not the precice mpn_gcd args
   criteria) but I think that's sub-optimal.

3. Now, if u < v swap them (this can only happen if we dropped all u
   zero bits).

4. Your point 3.  Which thresholds are you talking about?

-- 
Torbj?rn

From nisse at lysator.liu.se  Fri Feb 24 14:28:09 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Fri, 24 Feb 2012 14:28:09 +0100
Subject: Division call in mpn_gcd
In-Reply-To: <864nugh0w7.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Fri, 24 Feb 2012 13:27:36 +0100")
References: <86ipiwh7x8.fsf@shell.gmplib.org>
	<nn8vjsse7c.fsf@stalhein.lysator.liu.se>
	<864nugh0w7.fsf@shell.gmplib.org>
Message-ID: <nnzkc8qs2e.fsf@stalhein.lysator.liu.se>

Torbjorn Granlund <tg at gmplib.org> writes:

> Except that redc does not accept independent sizes.  Therefore we need
> to use bdiv_qr.

I think it should be easy to generalize at least redc_1 and redc_2 to
accept independent sizes. Might require some more work for redc_n, which
would need some kind of block-wise processing.

> 4. Your point 3.  Which thresholds are you talking about?

  REDC_1_TO_REDC_2_THRESHOLD
  REDC_2_TO_REDC_N_THRESHOLD

Using bdiv_qr is surely an improvement, but I think we really ought to
have some division function which doesn't require storage for the
unwanted quotient. Should that function be bdiv_r, or redc, or even
div_r? IIRC, bdiv and redc have slightly different notions on what the
"remainder" is, but I imagine either variant would fine for the gcd
reduction.

When un >> vn, gcd(u, v) shouldn't need O(un) scratch space (btw, this
calls for an initial reduction also in mpz_gcd).

There are a couple of other divisions in the gcd code where the quotient
is similarly unwanted: The initial division in mpn_gcdext, and the
(unlikely) division in mpn_gcd_subdiv_step. All these divisions have the
property that it doesn't really mattter if the remainder is a few bits
too large, so any final adjustment step can be omitted.

Regards,
/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From tg at gmplib.org  Fri Feb 24 14:51:00 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Fri, 24 Feb 2012 14:51:00 +0100
Subject: Division call in mpn_gcd
In-Reply-To: <nnzkc8qs2e.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?= message of "Fri\, 24 Feb 2012 14\:28\:09
	+0100")
References: <86ipiwh7x8.fsf@shell.gmplib.org>
	<nn8vjsse7c.fsf@stalhein.lysator.liu.se>
	<864nugh0w7.fsf@shell.gmplib.org>
	<nnzkc8qs2e.fsf@stalhein.lysator.liu.se>
Message-ID: <86y5rsfigr.fsf@shell.gmplib.org>

nisse at lysator.liu.se (Niels M?ller) writes:

  Torbjorn Granlund <tg at gmplib.org> writes:
  
  > Except that redc does not accept independent sizes.  Therefore we need
  > to use bdiv_qr.
  
  I think it should be easy to generalize at least redc_1 and redc_2 to
  accept independent sizes. Might require some more work for redc_n, which
  would need some kind of block-wise processing.
  
For some reason I have forgotten, we decided to stick to their current
definitions.  I suppose we'd basically just need to modify the outer
loop count, and perhaps handle carry-out from addmul_N differently.

  > 4. Your point 3.  Which thresholds are you talking about?
  
    REDC_1_TO_REDC_2_THRESHOLD
    REDC_2_TO_REDC_N_THRESHOLD
  
I'd say REDC_1_TO_REDC_2_THRESHOLD is more-or-less a plain comparison if
the speed of addmul_1 vs addmul_2, and the inversion costs.  They are
measured for one size operand, using two size operands ought to mean
that sqrt(un*vn) should be compared to REDC_1_TO_REDC_2_THRESHOLD.
(I.e., un*vn should be compared o REDC_1_TO_REDC_2_THRESHOLD^2.)

This reasoning disregards the constant term of the speed of an addmul_1
or addmul_2 invocation.

  Using bdiv_qr is surely an improvement, but I think we really ought to
  have some division function which doesn't require storage for the
  unwanted quotient. Should that function be bdiv_r, or redc, or even
  div_r?
  
I suppose the gcd functions are quite tolerant, at least for an initial
reduction like this one.

  IIRC, bdiv and redc have slightly different notions on what the
  "remainder" is, but I imagine either variant would fine for the gcd
  reduction.

Perhaps this is the reason for keeping redc separate?

  When un >> vn, gcd(u, v) shouldn't need O(un) scratch space (btw, this
  calls for an initial reduction also in mpz_gcd).
  
Makes sense.

  There are a couple of other divisions in the gcd code where the quotient
  is similarly unwanted: The initial division in mpn_gcdext,

Really?  Doesn't that quotient affect the cofactors?

I don't think we need to finalise redc vs (b)div_r before we modify the
gcd code to use hensel division; If tdiv_qr is good enough now, bdiv_qr
ought to be good enough to be worth as a separate improvement.  Sure, it
is cleaner to keep the quotient out-of-the-way (and it will very
slightly simplify the scratch space allocations).

-- 
Torbj?rn

From nisse at lysator.liu.se  Fri Feb 24 14:55:36 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Fri, 24 Feb 2012 14:55:36 +0100
Subject: mpn_mul_2c
Message-ID: <nnvcmwqqsn.fsf@stalhein.lysator.liu.se>

Here's a patch adding a new function mpn_mul_2c. Like mpn_mul_2, but
accepting an single-limb input carry.

I'd like to have it (and also mpn_addmul_2c) for generating diagonal
terms in sqr_basecase, but there may be other uses.

In the x86_64 assembly, I was tempted to move the initial
multiplication earlier, but when I tried I made mpn_mul_2 run a cycle
slower (problem is that n_param is in %rdx which collides with the
multiplication). Instead I had to duplicate the code for selecting the
loop entrypoint, and leave the old mul_2 code path unchanged.

Added support in devel/try.c, but there are no other testcases.
Comments appreciated.

Regards,
/Niels

-------------- next part --------------
A non-text attachment was scrubbed...
Name: mul_2c.patch
Type: text/x-patch
Size: 8330 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20120224/f09382dd/attachment.bin>
-------------- next part --------------

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.


From nisse at lysator.liu.se  Fri Feb 24 15:10:38 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Fri, 24 Feb 2012 15:10:38 +0100
Subject: Division call in mpn_gcd
In-Reply-To: <86y5rsfigr.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Fri, 24 Feb 2012 14:51:00 +0100")
References: <86ipiwh7x8.fsf@shell.gmplib.org>
	<nn8vjsse7c.fsf@stalhein.lysator.liu.se>
	<864nugh0w7.fsf@shell.gmplib.org>
	<nnzkc8qs2e.fsf@stalhein.lysator.liu.se>
	<86y5rsfigr.fsf@shell.gmplib.org>
Message-ID: <nnr4xkqq3l.fsf@stalhein.lysator.liu.se>

Torbjorn Granlund <tg at gmplib.org> writes:

> nisse at lysator.liu.se (Niels M?ller) writes:
>
>   IIRC, bdiv and redc have slightly different notions on what the
>   "remainder" is, but I imagine either variant would fine for the gcd
>   reduction.
>
> Perhaps this is the reason for keeping redc separate?

IIRC, bdiv functions return a borrow, meaning that the remainder
corresponding to the computed quotient is negative, while red returns a
carry which means that the computed remainder is a bit too large.

And then the questions was if a remainder-only function should follow
the redc convention, since that's the most important use, or the bdiv_qr
convention, for consistency.

>   There are a couple of other divisions in the gcd code where the quotient
>   is similarly unwanted: The initial division in mpn_gcdext,
>
> Really?  Doesn't that quotient affect the cofactors?

It affects one of the cofactors: the one which we're not going to
return.

> If tdiv_qr is good enough now, bdiv_qr ought to be good enough to be
> worth as a separate improvement.

I agree, but I can't promise I'll get around to do that soon.

Regards,
/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From tg at gmplib.org  Sat Feb 25 14:00:54 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Sat, 25 Feb 2012 14:00:54 +0100
Subject: Division call in mpn_gcd
In-Reply-To: <nnr4xkqq3l.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?= message of "Fri\, 24 Feb 2012 15\:10\:38
	+0100")
References: <86ipiwh7x8.fsf@shell.gmplib.org>
	<nn8vjsse7c.fsf@stalhein.lysator.liu.se>
	<864nugh0w7.fsf@shell.gmplib.org>
	<nnzkc8qs2e.fsf@stalhein.lysator.liu.se>
	<86y5rsfigr.fsf@shell.gmplib.org>
	<nnr4xkqq3l.fsf@stalhein.lysator.liu.se>
Message-ID: <86obsnkqyh.fsf@shell.gmplib.org>

nisse at lysator.liu.se (Niels M?ller) writes:

  > Perhaps this is the reason for keeping redc separate?
  
  IIRC, bdiv functions return a borrow, meaning that the remainder
  corresponding to the computed quotient is negative, while red returns a
  carry which means that the computed remainder is a bit too large.
  
That redc behaviour is just one week old...

  And then the questions was if a remainder-only function should follow
  the redc convention, since that's the most important use, or the bdiv_qr
  convention, for consistency.
  
And we shouldn't sacrifice speed for consistency, at the lowest mpn
level.

  > Really?  Doesn't that quotient affect the cofactors?
  
  It affects one of the cofactors: the one which we're not going to
  return.
  
I see.  I suppose that means the caller that really wants the cofactor
should performs this initial (Hensel) division, for efficiency.

-- 
Torbj?rn

From tg at gmplib.org  Sat Feb 25 23:03:35 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Sat, 25 Feb 2012 23:03:35 +0100
Subject: mpn_mul_2c
In-Reply-To: <nnvcmwqqsn.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?= message of "Fri\, 24 Feb 2012 14\:55\:36
	+0100")
References: <nnvcmwqqsn.fsf@stalhein.lysator.liu.se>
Message-ID: <86y5rqfu4o.fsf@shell.gmplib.org>

nisse at lysator.liu.se (Niels M?ller) writes:

  Here's a patch adding a new function mpn_mul_2c. Like mpn_mul_2, but
  accepting an single-limb input carry.
  
  I'd like to have it (and also mpn_addmul_2c) for generating diagonal
  terms in sqr_basecase, but there may be other uses.
  
Are you rewriting x86_64 sqr_basecase with calls to mul_2?  If that's
faster than the present code, then I think a version with these mul_2c
inlined will be even better.

Or is this experimental stuff?  In that case, are there reasons to
expect an x86_64 mul_2c to be actually used?  What for?

  In the x86_64 assembly, I was tempted to move the initial
  multiplication earlier, but when I tried I made mpn_mul_2 run a cycle
  slower (problem is that n_param is in %rdx which collides with the
  multiplication). Instead I had to duplicate the code for selecting the
  loop entrypoint, and leave the old mul_2 code path unchanged.
  
That's life.  I too have done that a few times.

  Added support in devel/try.c, but there are no other testcases.
  Comments appreciated.
  
Looks OK, except if the x86_64 asm mul_2c will never be used, I think
that change is somewhat questionable, and could be kept local.

-- 
Torbj?rn

From nisse at lysator.liu.se  Sun Feb 26 09:50:37 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Sun, 26 Feb 2012 09:50:37 +0100
Subject: mpn_mul_2c
In-Reply-To: <86y5rqfu4o.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Sat, 25 Feb 2012 23:03:35 +0100")
References: <nnvcmwqqsn.fsf@stalhein.lysator.liu.se>
	<86y5rqfu4o.fsf@shell.gmplib.org>
Message-ID: <nnmx86q8pu.fsf@stalhein.lysator.liu.se>

Torbjorn Granlund <tg at gmplib.org> writes:

> Are you rewriting x86_64 sqr_basecase with calls to mul_2?  If that's
> faster than the present code, then I think a version with these mul_2c
> inlined will be even better.

You're right it's somewhat experimental.

About sqr_basebase, I have two working versions. The addmul_1c based
version uses this loop, which handles one diagonal term and the
off-diagonal terms:

  for (i = 1; i < n-2; i++)
    {
      c0 = ul >> (GMP_LIMB_BITS - 1);
      ul = up[i];
      ulp = ul + c0;
      c1 = ulp < c0;
      umul_ppmm (p1, p0, ul, ulp);
      add_ssaaaa (p1, rp[2*i], p1, p0, -c1, rp[2*i]);

      rp[n+i] = mpn_addmul_1c (rp + 2*i+1, up + i + 1, n - i - 1,
      		       (ul << 1) + c0, p1);
    }

It can't compete with current sqr_basecase on x86_64, since the latter
uses addmul_2. It might be useful on other architectures. But it kind-of
begs for assembler implementation, to reduce the linear work outside of
the addmul_1c call (or maybe it would be good enough with just a special
addmul_1 entrypoint).

Then I have a version based on addmul_2c (which also uses mul_2c for the
first round). It uses this loop for the diagonal terms,

  for (; i < n-2; i += 2)
    {
      umul_ppmm (p1, p0, up[i], up[i+1]);
      add_ssaaaa (p1, rp[2*i+1], p1, rp[2*i+1], 0, p0);
      rp[n + i + 1] = mpn_addmul_2c (rp + 2*i + 2, up + i + 2, n - i - 2,
      			     up + i, p1);
    }

The iteration adds

   u_i * u_{i+1} 
       + B * <u_{i+1}, u_i> * <u_{n-1}, ..., u_{i+2}>

with the larger term computed by addmul_2c. This version is some 10%
slower than current sqr_basecase x86_64. Anyway, I think this
organization is promising and simpler than the current addmul_2 loop in
sqr_basecase.c.

This does *not* shift the offdiagonal terms on the fly; I think that
would be a bit too cumbersome in C, maybe it would make sense in
assembler. So one needs an mpn_sqr_diag_addlsh1 as well. I'm considering
doing that (preferably in assembler) with the following iteration

       +-------+
       |u  *  u|
       +---+---+---+
        2* | r1| r0|
           +---+---+
  +        | h1| h0|
 --+-------+---+---+
   |h1'|h0'| r1| r0|
   +---+---+---+---+

Then one can compute u * u + 2*r1 withut carry propagating further. But
unfortunately we do need the high carry h1, we can get the maximum value
h1 = 1 and h0 = 0. Or do you think it would be better to use an
organization like in the general addlsh1_n, computing as many of the
diagonal products as one can fit in registers, and then doing a carry
propagation add of some 4, 6 limbs or even eight limbs at a time?

> Looks OK, except if the x86_64 asm mul_2c will never be used, I think
> that change is somewhat questionable, and could be kept local.

Well, I can keep it local for the time being. I also have a patch with
an addmul_2c entrypoint now.

Regards,
/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From bodrato at mail.dm.unipi.it  Sun Feb 26 17:38:57 2012
From: bodrato at mail.dm.unipi.it (bodrato at mail.dm.unipi.it)
Date: Sun, 26 Feb 2012 17:38:57 +0100 (CET)
Subject: Double factorial and primorial
In-Reply-To: <52976.151.32.175.169.1324316867.squirrel@mail.dm.unipi.it>
References: <49296.151.32.244.246.1323285639.squirrel@mail.dm.unipi.it>
	<nnhb18kwt1.fsf@stalhein.lysator.liu.se>
	<86mxb0rxgw.fsf@shell.gmplib.org>
	<49490.151.32.245.79.1324157913.squirrel@mail.dm.unipi.it>
	<nnwr9ufgax.fsf@stalhein.lysator.liu.se>
	<49400.151.32.167.109.1324279755.squirrel@mail.dm.unipi.it>
	<nn4nwwgbym.fsf@stalhein.lysator.liu.se>
	<52976.151.32.175.169.1324316867.squirrel@mail.dm.unipi.it>
Message-ID: <49416.151.32.164.233.1330274337.squirrel@mail.dm.unipi.it>

Ciao

Il Lun, 19 Dicembre 2011 6:47 pm, bodrato at mail.dm.unipi.it ha scritto:
> Il Lun, 19 Dicembre 2011 10:14 am, Niels M?ller ha scritto:
>> Since the established name is "double factorial", I think one should
>> use some reasonable abbreviation of that term. _dblfac_ seems good
>> enough to me, if _double_fac_ is too long.

I changed the name one more time: it is mpz_2fac_ui, now. It's in the repo:
http://gmplib.org:8000/gmp/rev/f2f516affc0c

Regards,
m

-- 
http://bodrato.it/software/combinatorics.html


From tg at gmplib.org  Sun Feb 26 21:56:03 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Sun, 26 Feb 2012 21:56:03 +0100
Subject: More cleaning
Message-ID: <86boolfh5o.fsf@shell.gmplib.org>

The K&R support was removed a year or two ago.  But we still have some
clutter remaining from it.  I suppose __GMP_PROTO is an example.

-- 
Torbj?rn

From nisse at lysator.liu.se  Mon Feb 27 07:36:25 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Mon, 27 Feb 2012 07:36:25 +0100
Subject: More cleaning
In-Reply-To: <86boolfh5o.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Sun, 26 Feb 2012 21:56:03 +0100")
References: <86boolfh5o.fsf@shell.gmplib.org>
Message-ID: <nnhaycrdee.fsf@stalhein.lysator.liu.se>

Torbjorn Granlund <tg at gmplib.org> writes:

> The K&R support was removed a year or two ago.  But we still have some
> clutter remaining from it.  I suppose __GMP_PROTO is an example.

It would be nice to get rid of that. Another old remnant is the
__GMP_TOKEN_PASTE setup.

Regards,
/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From nisse at lysator.liu.se  Mon Feb 27 10:06:53 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Mon, 27 Feb 2012 10:06:53 +0100
Subject: Problem with the mp_set_memory_functions interface
In-Reply-To: <nn4nvgpzor.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?=
	message of "Fri, 27 Jan 2012 23:07:16 +0100")
References: <nn4nvgpzor.fsf@stalhein.lysator.liu.se>
Message-ID: <nnd390r6fm.fsf@stalhein.lysator.liu.se>

Any thoughts on this problem with the custom memory allocation
functions?

nisse at lysator.liu.se (Niels M?ller) writes:

> I find this part of the interface,
>
> :     The REALLOCATE_FUNCTION parameter OLD_SIZE and the FREE_FUNCTION
> :  parameter SIZE are passed for convenience, but of course they can be
> :  ignored if not needed by an implementation.  The default functions
> :  using `malloc' and friends for instance don't use them.
>    
> extremely awkward, bordering to totally broken.

[...]

> mpz_get_str, with a NULL buffer argument, is specified as allocating
> space *exactly* enough for the digits and the NUL terminator.
>
> :  If STR is `NULL', the result string is allocated using the current
> :  allocation function (*note Custom Allocation::).  The block will be
> :  `strlen(str)+1' bytes, that being exactly enough for the string and
> :  null-terminator.
>
> So if mpz_get_str does the natural thing of allocating
>
>   mpz_sizeinbase (x, base) + is_negative + 1
>
> bytes, it *must* use realloc to shrink that allocation in case it turns
> out that mpz_sizeinbase returned a value which was one off (as it is
> allowed to do). Thats really really ugly, it's an overhead that is
> caused by the memory allocation interface, and which is totally
> unnecessary for almost all users.
>
> And then, looking at GMP's mpz/get_str.c, it apparently fails to
> handle this case correctly!

[...]

> we have a bug which will manifest itself if
>
> 1. An application registers its own allocation function, and the custom
>    free function really depends on the size argument being correct.
>
> 2. The application uses mpz_get_str with NULL buffer, and deallocates it
>    according to the procedure suggested by the mpz_get_str
>    documentation.
>
> 3. The application then calls mpz_get_str with a value and a base, for
>    which mpz_sizeinbase returns a value larger than the actual number of
>    digits.

I'd really prefer not to fix this bug by adding a mostly useless realloc
call to mpz_get_str. I think I'd suggest for the short term:

1. Send an message to gmp-announce, asking if there's anybody out there
   who relies on correct old-size arguments to the reallocate function
   and the free function. And if so, how painful it would be if that
   feature was removed.

2. Based on the above, we can hopefully deprecate the old-size argument,
   and always pass zero (which is what mini-gmp does, btw).

Longer term, I think it would make sense to replace this interface,
deleting the old-size argument altogether. At the same time, we could
rename mp_set_memory_functions to gmp_set_memory_functions. Maybe one
could also simplify it a bit by using a function pointer for realloc
only (realloc(NULL, size) and realloc(p, 0) could be used as substitutes
for malloc and free)?

Regards,
/Niels


-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From tg at gmplib.org  Mon Feb 27 16:03:01 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Mon, 27 Feb 2012 16:03:01 +0100
Subject: New "fastsse" assembly
Message-ID: <861upgnwt6.fsf@shell.gmplib.org>

I pushed new x86_64 assembly making use of 128-bit instructions working
on xmm registers.  While all x86_64 processors probably support the
instructions used, some have less throughput using these than when using
plain 64-bit instructions.

The idea is to include these just before "x86_64" in the mdep search
path, but I have not done that yet; I want to look for small-operands
regressions first.

The files are:

  * copyi.asm and copyd.asm
  * lshift.asm (written in cooperation with David Harvey)

The challenge when using 128-bit ops is alignment; the limbs are just 64
bits while we work with 128 bit ops, and this means operand alignment is
not necessarily better than 64 bits.

The code cannot write the first or last limb with 128-bit operations
unless these are aligned (the last limb is aligned if either the src is
unaligned and the count is odd, or if the src is aligned and the count
us even).  It is however fine to make an *aligned* read using 128-bit
ops always, even if this sometimes mean we read outside of a defined
operand (although valgrind seem to dislike that practice...).

Further development needed:

* The lshift code is now not unrolled.  Unroll it 2x or 4x to achieve
  even better performance (note that the code already runs well on 5 of
  10 CPUs).

* Make sure lshift does not cause slowdown for small operands.  If
  needed add basecase code to counter slowdown.

* Consider loopmixing for individual CPUs.

* When lshift is finished, write analogous rshift.

* Write copyi/copyd that runs well also on core2 (see comments in these
  files; basically split loop into two, using movqda also for reads).

-- 
Torbj?rn

From tg at gmplib.org  Mon Feb 27 16:21:20 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Mon, 27 Feb 2012 16:21:20 +0100
Subject: Sandybridge addmul_N challenge
In-Reply-To: <864nuhp9oc.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Thu\, 23 Feb 2012 21\:38\:27 +0100")
References: <868vjxbvxl.fsf@shell.gmplib.org>
	<nnaa4buq5v.fsf@stalhein.lysator.liu.se>
	<86sji2khqo.fsf@shell.gmplib.org>
	<nnfwe2tus8.fsf@stalhein.lysator.liu.se>
	<86fwe2qbyp.fsf@shell.gmplib.org>
	<nn39a2szfh.fsf@stalhein.lysator.liu.se>
	<86d395kgc5.fsf@shell.gmplib.org>
	<nny5rtsi0r.fsf@stalhein.lysator.liu.se>
	<868vjtijoa.fsf@shell.gmplib.org>
	<nnty2hs4k6.fsf@stalhein.lysator.liu.se>
	<864nuhp9oc.fsf@shell.gmplib.org>
Message-ID: <86r4xgmhe7.fsf@shell.gmplib.org>

Torbjorn Granlund <tg at gmplib.org> writes:

  carry-in lo in r14
  carry-in hi in rcx
          mov     0(up), %rax
          mul     v1
          mov     8(rp), %r8
          add     %rax, %r8
          mov     %rdx, %r9
          adc     $0, %r9
          mov     8(up), %rax
          mul     v0
          add     %rax, %r8
          adc     %rdx, %r9
          mov     $0, R32(%rbx)
          adc     $0, R32(%rbx)
          add     %r14, %r8               C 0
          adc     %rcx, %r9               C 1
          adc     $0, R32(%rbx)           C might be removed
          mov     %r8, 8(rp)
  carry-out lo in r9
  carry-out hi in rbx
  
I committed code using that block, see mpn/x86_64/coreisbr/addmul_2.asm.

In the end, the code runs at about 3.2 c/l.  I have not reached 3.0 with
complete code.  I have no understanding of what limits things.

I played with convolution style code, i.e., code that multiplies and
accumulated columns-wise.  It runs at 2.5 c/l, not counting the final
summarisation code:

	.text
	.globl	main
main:
	push	%r12
	push	%r13
	push	%r14
	mov	$3300000000/4, %ecx
	.align	16
1:
	mov	8(%rsp), %rax
	mulq	16(%rsp)
	add	%rax, %r8
	adc	%rdx, %r9
	adc	$0, %r10d

	mov	8(%rsp), %rax
	mulq	16(%rsp)
	add	%rax, %r12
	adc	%rdx, %r13
	adc	$0, %r14d

	mov	8(%rsp), %rax
	mulq	16(%rsp)
	add	%rax, %r8
	adc	%rdx, %r9
	adc	$0, %r10d

	mov	8(%rsp), %rax
	mulq	16(%rsp)
	add	%rax, %r12
	adc	%rdx, %r13
	adc	$0, %r10d

	dec	%ecx
	jnz	1b

	pop	%r14
	pop	%r13
	pop	%r12
	ret

-- 
Torbj?rn

From tg at gmplib.org  Tue Feb 28 09:07:47 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Tue, 28 Feb 2012 09:07:47 +0100
Subject: New failures related to recent developments
Message-ID: <86399vidnw.fsf@shell.gmplib.org>

Three powerpc systems report failure this morning.  I suspect they would
have reported failure already yesterday, if compilation hadn't failed
due to a missing file...

The failures seem to be for small cases, for which the code
(mpz/lucnum2_ui.c) uses a dumpmp/mini-gmp generated table.

I therefore suspect a problem with the new bootstrap.c or mini-gmp.c.

I compared mpn/fib_table.c with a system that did not report any
failures (but this table to was generated with mini-gmp.c).  They have
the same contents.

But fib-table.h has

#define FIB_TABLE_LIMIT         47
#define FIB_TABLE_LUCNUM_LIMIT  47

on one of the failing systems and

#define FIB_TABLE_LIMIT         47
#define FIB_TABLE_LUCNUM_LIMIT  46

on a non-failing system.

(These are from 32-bit builds.  Corresponding differences can be
observed on failing 64-bit builds.)

-- 
Torbj?rn

From nisse at lysator.liu.se  Tue Feb 28 09:35:27 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Tue, 28 Feb 2012 09:35:27 +0100
Subject: New failures related to recent developments
In-Reply-To: <86399vidnw.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Tue, 28 Feb 2012 09:07:47 +0100")
References: <86399vidnw.fsf@shell.gmplib.org>
Message-ID: <nnzkc3pd80.fsf@stalhein.lysator.liu.se>

Torbjorn Granlund <tg at gmplib.org> writes:

> I compared mpn/fib_table.c with a system that did not report any
> failures (but this table to was generated with mini-gmp.c).  They have
> the same contents.

I compared the output of gen-fib table 32 0, in current gmp and
gmp-5.0.2. Result is identical on my machine.

> But fib-table.h has
>
> #define FIB_TABLE_LIMIT         47
> #define FIB_TABLE_LUCNUM_LIMIT  47
>
> on one of the failing systems and
>
> #define FIB_TABLE_LIMIT         47
> #define FIB_TABLE_LUCNUM_LIMIT  46

I also get 

#define FIB_TABLE_LUCNUM_LIMIT  46

for both current and gmp-5.0.2. So I think we can conclude that's the
correct definition.

I'm not entirely sure I understand fib-gen is supposed work. luc_limit
is only assigned like this, in gen-fib.c:generate,

      if (mpz_cmp (l, limit) < 0)
	luc_limit = i-1;
              
Looking at mini-gmp.c:mpz_cmp, I've spotted one bug, but I think
that's unrelated since it affects negative numbers only.

diff -r e21157bb513d mini-gmp/mini-gmp.c
--- a/mini-gmp/mini-gmp.c	Mon Feb 27 14:37:02 2012 +0100
+++ b/mini-gmp/mini-gmp.c	Tue Feb 28 09:26:59 2012 +0100
@@ -1694,7 +1694,7 @@ mpz_cmp (const mpz_t a, const mpz_t b)
   else if (asize > 0)
     return mpn_cmp (a->_mp_d, b->_mp_d, asize);
   else if (asize < 0)
-    return -mpn_cmp (a->_mp_d, b->_mp_d, asize);
+    return -mpn_cmp (a->_mp_d, b->_mp_d, -asize);
   else
     return 0;
 }

Regards,
/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From tg at gmplib.org  Tue Feb 28 09:47:52 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Tue, 28 Feb 2012 09:47:52 +0100
Subject: New failures related to recent developments
In-Reply-To: <nnzkc3pd80.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?= message of "Tue\, 28 Feb 2012 09\:35\:27
	+0100")
References: <86399vidnw.fsf@shell.gmplib.org>
	<nnzkc3pd80.fsf@stalhein.lysator.liu.se>
Message-ID: <86r4xffio7.fsf@shell.gmplib.org>

nisse at lysator.liu.se (Niels M?ller) writes:

  I'm not entirely sure I understand fib-gen is supposed work. luc_limit
  is only assigned like this, in gen-fib.c:generate,
  
        if (mpz_cmp (l, limit) < 0)
  	luc_limit = i-1;
                
  Looking at mini-gmp.c:mpz_cmp, I've spotted one bug, but I think
  that's unrelated since it affects negative numbers only.
  
  diff -r e21157bb513d mini-gmp/mini-gmp.c
  --- a/mini-gmp/mini-gmp.c	Mon Feb 27 14:37:02 2012 +0100
  +++ b/mini-gmp/mini-gmp.c	Tue Feb 28 09:26:59 2012 +0100
  @@ -1694,7 +1694,7 @@ mpz_cmp (const mpz_t a, const mpz_t b)
     else if (asize > 0)
       return mpn_cmp (a->_mp_d, b->_mp_d, asize);
     else if (asize < 0)
  -    return -mpn_cmp (a->_mp_d, b->_mp_d, asize);
  +    return -mpn_cmp (a->_mp_d, b->_mp_d, -asize);
     else
       return 0;
   }
  
Since this happens on just a few systems, I don't think it is a simple
logical bug.

I would guess it is a dependency on uninitialised data.

The failing systems seem to have no working debugging environment (which
I understand).

-- 
Torbj?rn

From nisse at lysator.liu.se  Tue Feb 28 10:14:11 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Tue, 28 Feb 2012 10:14:11 +0100
Subject: New failures related to recent developments
In-Reply-To: <86r4xffio7.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Tue, 28 Feb 2012 09:47:52 +0100")
References: <86399vidnw.fsf@shell.gmplib.org>
	<nnzkc3pd80.fsf@stalhein.lysator.liu.se>
	<86r4xffio7.fsf@shell.gmplib.org>
Message-ID: <nnvcmrpbfg.fsf@stalhein.lysator.liu.se>

Torbjorn Granlund <tg at gmplib.org> writes:

> The failing systems seem to have no working debugging environment (which
> I understand).

Printing out the value of limit (ought to be two limbs: 0, 1) and inputs
and output of the mpz_cmp call would be helpful, I think. Then we'll see
which operation goes wrong.

/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From tg at gmplib.org  Tue Feb 28 10:14:22 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Tue, 28 Feb 2012 10:14:22 +0100
Subject: New failures related to recent developments
In-Reply-To: <nnzkc3pd80.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?= message of "Tue\, 28 Feb 2012 09\:35\:27
	+0100")
References: <86399vidnw.fsf@shell.gmplib.org>
	<nnzkc3pd80.fsf@stalhein.lysator.liu.se>
Message-ID: <86mx83fhg1.fsf@shell.gmplib.org>

There is a systematic problem in mini-gmp.c when MPZ_REALLOC is called
when a destination variable is the same as some other (source or
destination) variable.

After MPZ_REALLOC, all cached pointers must be considered to be defunct.

I've spotted this error in 4 functions, but I haven't made a proper code
review.

-- 
Torbj?rn

From tg at gmplib.org  Tue Feb 28 10:18:30 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Tue, 28 Feb 2012 10:18:30 +0100
Subject: New failures related to recent developments
In-Reply-To: <86mx83fhg1.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Tue\, 28 Feb 2012 10\:14\:22 +0100")
References: <86399vidnw.fsf@shell.gmplib.org>
	<nnzkc3pd80.fsf@stalhein.lysator.liu.se>
	<86mx83fhg1.fsf@shell.gmplib.org>
Message-ID: <86haybfh95.fsf@shell.gmplib.org>

Torbjorn Granlund <tg at gmplib.org> writes:

  There is a systematic problem in mini-gmp.c when MPZ_REALLOC is called
  when a destination variable is the same as some other (source or
  destination) variable.
  
  After MPZ_REALLOC, all cached pointers must be considered to be defunct.
  
  I've spotted this error in 4 functions, but I haven't made a proper code
  review.

This gives an idea for a testing mode allocation trick:

Let the MPZ_REALLOC macro always allocate a new block whether needed or
not, copy the data thereto, write random garbage to the old area, then
free it.  This will make any defunct pointers read data that very likely
will cause an obvious miscomputation.

-- 
Torbj?rn

From nisse at lysator.liu.se  Tue Feb 28 10:23:40 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Tue, 28 Feb 2012 10:23:40 +0100
Subject: New failures related to recent developments
In-Reply-To: <86mx83fhg1.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Tue, 28 Feb 2012 10:14:22 +0100")
References: <86399vidnw.fsf@shell.gmplib.org>
	<nnzkc3pd80.fsf@stalhein.lysator.liu.se>
	<86mx83fhg1.fsf@shell.gmplib.org>
Message-ID: <nnr4xfpazn.fsf@stalhein.lysator.liu.se>

Torbjorn Granlund <tg at gmplib.org> writes:

> After MPZ_REALLOC, all cached pointers must be considered to be defunct.

It seems I was only paying attention to the destination pointer. I'll
go through all uses of MPZ_REALLOC.

Regards,
/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From tg at gmplib.org  Tue Feb 28 10:28:20 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Tue, 28 Feb 2012 10:28:20 +0100
Subject: New failures related to recent developments
In-Reply-To: <nnr4xfpazn.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?= message of "Tue\, 28 Feb 2012 10\:23\:40
	+0100")
References: <86399vidnw.fsf@shell.gmplib.org>
	<nnzkc3pd80.fsf@stalhein.lysator.liu.se>
	<86mx83fhg1.fsf@shell.gmplib.org>
	<nnr4xfpazn.fsf@stalhein.lysator.liu.se>
Message-ID: <86boojfgsr.fsf@shell.gmplib.org>

nisse at lysator.liu.se (Niels M?ller) writes:

  Torbjorn Granlund <tg at gmplib.org> writes:
  
  > After MPZ_REALLOC, all cached pointers must be considered to be defunct.
  
  It seems I was only paying attention to the destination pointer. I'll
  go through all uses of MPZ_REALLOC.
  
This is a (partial?) patch.  It seems to fix the present problem.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: diff
Type: application/octet-stream
Size: 1856 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20120228/abc36137/attachment.obj>
-------------- next part --------------

-- 
Torbj?rn

From nisse at lysator.liu.se  Tue Feb 28 11:15:06 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Tue, 28 Feb 2012 11:15:06 +0100
Subject: New failures related to recent developments
In-Reply-To: <86boojfgsr.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Tue, 28 Feb 2012 10:28:20 +0100")
References: <86399vidnw.fsf@shell.gmplib.org>
	<nnzkc3pd80.fsf@stalhein.lysator.liu.se>
	<86mx83fhg1.fsf@shell.gmplib.org>
	<nnr4xfpazn.fsf@stalhein.lysator.liu.se>
	<86boojfgsr.fsf@shell.gmplib.org>
Message-ID: <nnmx83p8lx.fsf@stalhein.lysator.liu.se>

Torbjorn Granlund <tg at gmplib.org> writes:

> This is a (partial?) patch.  It seems to fix the present problem.
> +   rp = MPZ_REALLOC (r, an + 1);
> + 
> +   ap = a->_mp_d;
> +   bp = b->_mp_d;
> + 
>     if (an < bn)
>       MPN_PTR_SWAP (ap, an, bp, bn);
>   
>     cy = mpn_add (rp, ap, an, bp, bn);
>     rp[an] = cy;

I think this fix to mpz_abs_add is almost right, but the realloc must
use a size MAX(an, bn) + 1. Maybe it ought to be reorganized a bit,
eliminating the ap, bp pointers and the swapping. Something like

   rn = GMP_MAX (an, bn);
   rp = MPZ_REALLOC (r, rn + 1);

   if (an < bn)
     cy = mpn_add (rp, b->_mp_d, bn, a->_mp_d, an);
   else
     cy = mpn_add (rp, a->_mp_d, an, b->_mp_d, bn);

   if (cy > 0)
     rp[rn++] = cy;
     
Will you commit these fixes, or do you want me to do that?

I have found the same four direct MPZ_REALLOC problems when reviewing
the code: mpz_abs_add, mpz_and, mpz_ior and mpz_xor. Then I have loooked
for functions which use cached pointers over a call to a function using
MPZ_REALLOC. But I didn't find any problems of that type.

There are couple of additional pointers cached over an MPZ_REALLOC of a
temporary, but that shouldn't be a problem since the temporary never
overlaps anything else.

Regards,
/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From tg at gmplib.org  Tue Feb 28 11:21:25 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Tue, 28 Feb 2012 11:21:25 +0100
Subject: New failures related to recent developments
In-Reply-To: <nnmx83p8lx.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?= message of "Tue\, 28 Feb 2012 11\:15\:06
	+0100")
References: <86399vidnw.fsf@shell.gmplib.org>
	<nnzkc3pd80.fsf@stalhein.lysator.liu.se>
	<86mx83fhg1.fsf@shell.gmplib.org>
	<nnr4xfpazn.fsf@stalhein.lysator.liu.se>
	<86boojfgsr.fsf@shell.gmplib.org>
	<nnmx83p8lx.fsf@stalhein.lysator.liu.se>
Message-ID: <86399vfeca.fsf@shell.gmplib.org>

nisse at lysator.liu.se (Niels M?ller) writes:

  Will you commit these fixes, or do you want me to do that?
  
Please take care of any fixing.  It's your baby after all.  :-)

  I have found the same four direct MPZ_REALLOC problems when reviewing
  the code: mpz_abs_add, mpz_and, mpz_ior and mpz_xor. Then I have loooked
  for functions which use cached pointers over a call to a function using
  MPZ_REALLOC. But I didn't find any problems of that type.
  
These other usages all seem to be safe, I finished a review too, since
four eyes see more than two.

(I think the mini-gmp test suite should have caught these errors.  I'd
suggest that you enhance it, and make sure these types of errors are
actually detected.)

-- 
Torbj?rn

From nisse at lysator.liu.se  Tue Feb 28 15:40:00 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Tue, 28 Feb 2012 15:40:00 +0100
Subject: New failures related to recent developments
In-Reply-To: <86399vfeca.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Tue, 28 Feb 2012 11:21:25 +0100")
References: <86399vidnw.fsf@shell.gmplib.org>
	<nnzkc3pd80.fsf@stalhein.lysator.liu.se>
	<86mx83fhg1.fsf@shell.gmplib.org>
	<nnr4xfpazn.fsf@stalhein.lysator.liu.se>
	<86boojfgsr.fsf@shell.gmplib.org>
	<nnmx83p8lx.fsf@stalhein.lysator.liu.se>
	<86399vfeca.fsf@shell.gmplib.org>
Message-ID: <nnipirowcf.fsf@stalhein.lysator.liu.se>

Torbjorn Granlund <tg at gmplib.org> writes:

> Please take care of any fixing.  It's your baby after all.  :-)

Checked in now. Hope it's sufficient.

> (I think the mini-gmp test suite should have caught these errors.  I'd
> suggest that you enhance it, and make sure these types of errors are
> actually detected.)

I've added some make rules to make it possible to run make
check-mini-gmp from the gmp tree. It's should work both when building in
the srcdir and when using a separate build dir. Limitations: It uses
mini-gmp/tests/Makefile which requires GNU make. And built files are not
removed by make clean (I'd need to check the automake manual for where
to put that).

No changes to the actual tests, yet. There's lots of room for
improvements there.

The bug in mpz_abs_add actually triggered failures in mini-gmp's t-gcd
and t-powm. But they didn't earlier when I built mini-gmp separately, I
don't understand why.

Regards,
/Niels

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From tg at gmplib.org  Tue Feb 28 19:55:19 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Tue, 28 Feb 2012 19:55:19 +0100
Subject: New failures related to recent developments
In-Reply-To: <nnipirowcf.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?= message of "Tue\, 28 Feb 2012 15\:40\:00
	+0100")
References: <86399vidnw.fsf@shell.gmplib.org>
	<nnzkc3pd80.fsf@stalhein.lysator.liu.se>
	<86mx83fhg1.fsf@shell.gmplib.org>
	<nnr4xfpazn.fsf@stalhein.lysator.liu.se>
	<86boojfgsr.fsf@shell.gmplib.org>
	<nnmx83p8lx.fsf@stalhein.lysator.liu.se>
	<86399vfeca.fsf@shell.gmplib.org>
	<nnipirowcf.fsf@stalhein.lysator.liu.se>
Message-ID: <867gz6aiug.fsf@shell.gmplib.org>

nisse at lysator.liu.se (Niels M?ller) writes:

  Torbjorn Granlund <tg at gmplib.org> writes:
  
  > Please take care of any fixing.  It's your baby after all.  :-)
  
  Checked in now. Hope it's sufficient.
  
We'll see tomorrow morning.

Note that the failures are sticky, a passing "make check" does not
supplant a previous failing one, since that could lead us to miss a
failure.  Once the new builds are done, I will check that the previously
failing seeds now preduce no error, then clear the failure to make the
reporting square become green (gmplib.org/devel/tm-date.html).

  I've added some make rules to make it possible to run make
  check-mini-gmp from the gmp tree. It's should work both when building in
  the srcdir and when using a separate build dir. Limitations: It uses
  mini-gmp/tests/Makefile which requires GNU make. And built files are not
  removed by make clean (I'd need to check the automake manual for where
  to put that).

Please then add both a clean and a distclean target (perhaps working in
the exact same way).
  
  No changes to the actual tests, yet. There's lots of room for
  improvements there.
  
Do you test reuse at all?

  The bug in mpz_abs_add actually triggered failures in mini-gmp's t-gcd
  and t-powm. But they didn't earlier when I built mini-gmp separately, I
  don't understand why.
  
Different optimisation?

-- 
Torbj?rn

From nisse at lysator.liu.se  Tue Feb 28 23:12:04 2012
From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
Date: Tue, 28 Feb 2012 23:12:04 +0100
Subject: New failures related to recent developments
In-Reply-To: <867gz6aiug.fsf@shell.gmplib.org> (Torbjorn Granlund's message of
	"Tue, 28 Feb 2012 19:55:19 +0100")
References: <86399vidnw.fsf@shell.gmplib.org>
	<nnzkc3pd80.fsf@stalhein.lysator.liu.se>
	<86mx83fhg1.fsf@shell.gmplib.org>
	<nnr4xfpazn.fsf@stalhein.lysator.liu.se>
	<86boojfgsr.fsf@shell.gmplib.org>
	<nnmx83p8lx.fsf@stalhein.lysator.liu.se>
	<86399vfeca.fsf@shell.gmplib.org>
	<nnipirowcf.fsf@stalhein.lysator.liu.se>
	<867gz6aiug.fsf@shell.gmplib.org>
Message-ID: <nnmx82obez.fsf@stalhein.lysator.liu.se>

Torbjorn Granlund <tg at gmplib.org> writes:

> Please then add both a clean and a distclean target (perhaps working in
> the exact same way).

I'll look into it. I figured it's not urgent, since the files to be
cleaned away are not built by default, only when the check-mini-gmp
target is used explicitly.

>   No changes to the actual tests, yet. There's lots of room for
>   improvements there.
>   
> Do you test reuse at all?

No. It would be good to do that systematically (like the GMP testsuite).
Possibly in combination with the MPZ_REALLOC hack you mentioned.

/nisse

-- 
Niels M?ller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

From tg at gmplib.org  Tue Feb 28 23:19:38 2012
From: tg at gmplib.org (Torbjorn Granlund)
Date: Tue, 28 Feb 2012 23:19:38 +0100
Subject: New failures related to recent developments
In-Reply-To: <nnmx82obez.fsf@stalhein.lysator.liu.se> ("Niels
	=?iso-8859-1?Q?M=F6ller=22's?= message of "Tue\, 28 Feb 2012 23\:12\:04
	+0100")
References: <86399vidnw.fsf@shell.gmplib.org>
	<nnzkc3pd80.fsf@stalhein.lysator.liu.se>
	<86mx83fhg1.fsf@shell.gmplib.org>
	<nnr4xfpazn.fsf@stalhein.lysator.liu.se>
	<86boojfgsr.fsf@shell.gmplib.org>
	<nnmx83p8lx.fsf@stalhein.lysator.liu.se>
	<86399vfeca.fsf@shell.gmplib.org>
	<nnipirowcf.fsf@stalhein.lysator.liu.se>
	<867gz6aiug.fsf@shell.gmplib.org>
	<nnmx82obez.fsf@stalhein.lysator.liu.se>
Message-ID: <86vcmq8uth.fsf@shell.gmplib.org>

  > Please then add both a clean and a distclean target (perhaps working in
  > the exact same way).
  
  I'll look into it. I figured it's not urgent, since the files to be
  cleaned away are not built by default, only when the check-mini-gmp
  target is used explicitly.
  
I agree this is highly non-urgent.  -)

I just mentioned the desirable targets to have your TODO file contain
the right entry, avoiding the need to patch the patch.

  No. It would be good to do that systematically (like the GMP testsuite).
  Possibly in combination with the MPZ_REALLOC hack you mentioned.
  
It might be easiest to steal GMP's mpz/reuse.c since its table-driven
approach should make it a very quick job.

-- 
Torbj?rn