Adrien Prost-Boucle adrien.prost-boucle at
Fri Dec 23 16:22:29 UTC 2016


I have modified my code, following advice from Marco Bodrato (thx!).
I now have the same API for my implementation and for the code I've taken from GMP.
See the archive on my server (somehow my email is rejected if with attachment):

Updated speedup:
  laptop Core2 Duo .. 0% for 32-bit, 3% for 64-bit, 4% for 64x2
  machine i7-4790 ... 3% for 32-bit, 4% for 64-bit, still 20% for 64x2
  Note: my machines are under some load right now, so you may find slightly different values

So the speedup of my implementation is lower that my initial code.
But there is still a bit of speedup, particularly for 64x2 on my higher-end CPU.
(well, that'll hold until only you find another possible optim to my code xD).
However, that won't bring the magic speedup Paul Zimmermann initially wanted in this message:

It seems most of the overhead came from that test:

  // Separately handle when the number is already normalized
  if(val & (((uint64_t)3) << (64 - 2))) {
    return mpn_sqrtrem1_norm64(&rl, val);

That test was present in GMP in function mpn_sqrtrem(), file sqrtrem.c line 447
So I kept it for coherence, to compare against what's actually in GMP.
That test costs a lot because it prevents efficient branch prediction.
But that test was only in function mpn_sqrtrem() and not in "core" functions that call mpn_sqrtrem1() and mpn_sqrtrem2().
So my updated "mpn" code should better look like what's currently in GMP.

Something else I'm not sure of:
In my 64x2 implementation, function sqrt64x2_in(), there are many uses of macros 64x2 for add, sub, mul.
Whereas the GMP algorithm consists of one call to sqrt64 (one-word),
then some operations to get the sqrt64x2 (one div + 2 mul + misc arith and logic).
So for me it's not obvious that my version would bring speedup on ALL platforms, particularly when having hardware division.
Like what I measured on my laptop.
Any idea?

But when not having hardware division, then my 64x2 implementation should be better for sure.
However don't forget that nothing is fully checked nor proven in my implementation! It may still be wrong in some corner cases!


On Tue, 2016-12-20 at 22:36 +0100, Marco Bodrato wrote:
> Ciao,
> Il Lun, 19 Dicembre 2016 6:21 pm, Adrien Prost-Boucle ha scritto:
> > The program enables to launch benchmark with these commands:
> > ./sqrt bench32 10000000
> > ./sqrt bench64 10000000
> > ./sqrt bench64x2 10000000
> It is quite difficult to interpret the numbers, times spent by direct
> calls to some functions is compared to the time spent by other functions
> called trough a wrapper...
> You should decide a single interface and adapt to it all the to be
> compared algorithms.
> Of course, from the point of view of GMP, the best would be to compare
> functions that can be used as a drop-in replacement of the current
> mpn_sqrtrem1 (and mpn_sqrtrem2) functions in GMP mpn/generic/sqrtrem.c
> code.
> Obviously you can choose a different interface, but you should avoid
> wrappers.
> Regards,
> m

More information about the gmp-devel mailing list