[PATCH] mpn/generic/perfsqr: Improve alternate (currently disabled) test.

Mon Mar 2 21:32:47 CET 2026

Ciao David,

14 feb 2026, 19:11 da sparks05 at proton.me:

> If count_trailing_zeros is slow (i.e. not implemented in hardware),
> the mod^2^k test can skip shifting down the input and instead
> shift up the bits to be tested.
>
That's interesting.
We should measure if it's also effective. So, at first, I slightly changed the tune/speed function to measure also the size 1, where the impact is more relevant.

https://gmplib.org/repo/gmp/rev/a15023544812

Anyway, tune/speed measure only the worst case, i.e. when the operand actually is a square.

> +  /* Check that we have even multiplicity of 2, and then check that the rest
> +     is a possible perfect square mod 8.  Leave disabled until we can
> +     determine this really is an improvement.  This supersedes the mod-256
> +     test, throwing out more non-squares (5/6 = 83.3%), but at the expense
> +     of somewhat more cycles.
>

Then I looked at the code. The current mod-256 test detects the 82.8% of non-squares,
in O(1) time. The new one promises a better rate: 83.3%. Not much, and...

> -    lo = up[0];
> -    while (lo == 0)
> -      up++, lo = up[0], usize--;
>

...this test is not even O(1)!
But in the rare circumstances when it isn't O(1), the size of the following tests is reduced.
This sounds as a good idea.

My question: can we really skip limbs?
My answer: if GMP_NUMB_BITS is even, yes we can. But if it's odd...

That's why I tried to write a code working for any GMP_NUMB_BITS value.
I attach a proposed patch. GCC seems able to optimize out all the additional computations
needed only when GMP_NUMB_BITS is odd.

I hardly measure any difference, between the current code (#if 1) and the proposed one (#if 0).
That's why I changed tune/speed again, now mpn_perfect_square_p.2 measures the time
spent by the function on operands with the lowest 2 limbs set to zero.

https://gmplib.org/repo/gmp/rev/48b8f3a3c4ce

Here are the results on my laptop.
Current code (#if 1):
build/tune/speed.1 -s1-4 -p100000000 -c mpn_perfect_square_p mpn_perfect_square_p.1 mpn_perfect_square_p.2
overhead 4.89 cycles, precision 100000000 units of 5.28e-10 secs, CPU freq 1895.61 MHz
        mpn_perfect_square_p mpn_perfect_square_p.1 mpn_perfect_square_p.2
1               71.47         71.50        #71.46
2              130.36        128.36       #128.27
3              349.57        343.29       #334.26
4              289.18        289.35       #287.18

Proposed code (#if 0):
build/tune/speed.0 -s1-4 -p100000000 -c mpn_perfect_square_p mpn_perfect_square_p.1 mpn_perfect_square_p.2
overhead 4.89 cycles, precision 100000000 units of 5.28e-10 secs, CPU freq 1895.58 MHz
        mpn_perfect_square_p mpn_perfect_square_p.1 mpn_perfect_square_p.2
1               74.44        #74.25         74.34
2              133.11        #74.83         74.87
3              344.04        134.13        #75.99
4              290.53        355.27       #137.47

Here are the results on shell at gmp.
Current code (#if 1):
[bodrato at shell ~/gmp-repo]$ ./speed.18498_1 -s 1-4 -p1000000000 -c mpn_perfect_square_p mpn_perfect_square_p.1 mpn_perfect_square_p.2
overhead 5.35 cycles, precision 1000000000 units of 2.86e-10 secs, CPU freq 3500.17 MHz
        mpn_perfect_square_p mpn_perfect_square_p.1 mpn_perfect_square_p.2
1              #69.91         69.91         69.92
2              130.71       #126.20        126.20
3              338.19        334.79       #334.32
4              284.02       #279.31        282.25

Proposed code (#if 0):
[bodrato at shell ~/gmp-repo]$ ./speed.18498_0 -s 1-4 -p1000000000 -c mpn_perfect_square_p mpn_perfect_square_p.1 mpn_perfect_square_p.2
overhead 5.35 cycles, precision 1000000000 units of 2.86e-10 secs, CPU freq 3500.17 MHz
        mpn_perfect_square_p mpn_perfect_square_p.1 mpn_perfect_square_p.2
1               70.31        #70.30         70.35
2              132.74        #70.94         71.01
3              340.88        133.17        #72.19
4              278.65        337.05       #132.94

It seems slightly slower in general, even if it can be much faster when skipping zero limbs. But is this case relevant? For the mpn_ layer, we may assume that if this case is frequent, the user code can handle it efficently. on the other side, that's not equally easy for the mpz_ layer.
Should we use that version of the code? Or just change it while keeping it disabled :-) ?

Ĝis,mb
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gmp.diff
Type: text/x-patch
Size: 1792 bytes
Desc: not available
URL: <https://gmplib.org/list-archives/gmp-devel/attachments/20260302/a3bae5b3/attachment.bin>