mpn_sqrtrem{1,2} - patch for pure C implem

Sun Apr 2 10:52:14 UTC 2017

Hi,

On my side, I do observe a little improvement (64-bit).
With original GMP correction:

1000000x 100 values of 64 bits
Time: 4.77369 s
1000000x 100 values of 48 bits
Time: 5.07621 s
1000000x 100 values of 32 bits
Time: 5.07722 s
1000000x 100 values of 16 bits
Time: 5.24327 s

With my new proposal (let's name that proposal 2):

1000000x 100 values of 64 bits
Time: 4.68028 s
1000000x 100 values of 48 bits
Time: 4.91651 s
1000000x 100 values of 32 bits
Time: 4.99838 s
1000000x 100 values of 16 bits
Time: 5.16515 s

Note that for each bit size, my test first generates 100 random values
and then does sqrt repeatedly, changing value each time.
I think this pseudo-random data should look like real-world usage...
Execution times vary slightly between runs, but the general trend is there.

I also tried a third, and fourth way of writing the correction code.
It uses another variable accu2 to pre-compute the correction.

Proposal 3:
(force sharing intermediate results)
(same speed as my proposal 2)

  /* Correction: add 1 to the root if needed (test: r^2 + 2r + 1 <= a) */
  accu = a0 - root * root;
  accu2 = 2*root + 1;
  if ((mp_limb_signed_t)accu2 <= (mp_limb_signed_t)accu) {
    accu -= accu2;
    root++;
  }
  *rp = accu;

Proposal 4:
(should better fit conditional assignment instructions)
(but appears slightly slower on my laptop)

  accu = a0 - root * root;
  accu2 = accu - (2*root + 1);
  if ((mp_limb_signed_t)accu2 >= 0) {
    accu = accu2;
    root++;
  }
  *rp = accu;

Probably for each implementation there is a combination of compiler+machine for which it's best.
Maybe this part not critical, we just keep the version that is clearest to read?
The real gain will be with machine-specific sqrtrem1 implementation.

I'll benchmark with ABI=32 as you suggest.

Regards,
Adrien

On Sun, 2017-04-02 at 06:44 +0200, Marco Bodrato wrote:
> Ciao,
> 
> Il Sab, 1 Aprile 2017 9:02 pm, Adrien Prost-Boucle ha scritto:
> > On Sat, 2017-04-01 at 18:15 +0200, Marco Bodrato wrote:
> > > After the patch:
> > > $ (cd tests/devel; make sqrtrem_1_2)&&time tests/devel/sqrtrem_1_2 c
> > > Corner cases tested, up to bits:
> > > \ 63
> > > Values of a single limb, tested.
> > > 
> > > > user	1m47.889s
> > I also saved one instruction in the final correction step (for now lightly
> > tested, just make check).
> > +  /* Correction: add 1 to the root if needed (test: r^2 + 2r - a < 0) */
> > +  accu = a0 - root * root;
> > +  if ( (mp_limb_signed_t)(2*root - accu) < 0) {
> > +    accu -= 2*root + 1;
> > +    root++;
> > +  }
> 
> Compared to your previous proposal, that's a small regression, I get:
> > user 	1m49.209s
> 
> I refined a little bit the exhaustive testing program for sqrtrem[12]:
> https://gmplib.org/repo/gmp/rev/c32d616089b4
> 
> For ABI=32, can you please tell us the timings obtained with:
> 
> make&&(cd tests/devel/;make sqrtrem_1_2&&time ./sqrtrem_1_2 x 1)
> 
> before, and after your patch (maybe playing with the different flavours of
> the correction step :-)?
> 
> Best regards,
> m
>