[PATCH] mpn/generic/perfsqr: Improve alternate (currently disabled) test.

Tue Mar 3 10:46:41 CET 2026

Thank you for the second look!  It's wonderful to have a second
perspective on an idea.

I mostly just want to document what the mod-256 test is doing.
It's easy to dismiss the lookup table as something opaque, but
what it actually does is very simple.

On Monday, March 2nd, 2026 at 20:32, marco.bodrato at tutanota.com <marco.bodrato at tutanota.com> wrote:
>> If count_trailing_zeros is slow (i.e. not implemented in hardware),
>> the mod^2^k test can skip shifting down the input and instead
>> shift up the bits to be tested.
> 
> That's interesting.
> We should measure if it's also effective. So, at first, I slightly changed the tune/speed function to measure also the size 1, where the impact is more relevant.
> 
> https://gmplib.org/repo/gmp/rev/a15023544812
> 
> Anyway, tune/speed measure only the worst case, i.e. when the operand actually is a square.
> 
>> + /* Check that we have even multiplicity of 2, and then check that the rest
>> + is a possible perfect square mod 8. Leave disabled until we can
>> + determine this really is an improvement. This supersedes the mod-256
>> + test, throwing out more non-squares (5/6 = 83.3%), but at the expense
>> + of somewhat more cycles.

> Then I looked at the code. The current mod-256 test detects the 82.8% of non-squares,
> in O(1) time. The new one promises a better rate: 83.3%. Not much, and...
> 
>> - lo = up[0];
>> - while (lo == 0)
>> - up++, lo = up[0], usize--;
> 
> ...this test is not even O(1)!
> But in the rare circumstances when it isn't O(1), the size of the following tests is reduced.
> This sounds as a good idea.

> My question: can we really skip limbs?
> My answer: if GMP_NUMB_BITS is even, yes we can. But if it's odd...
> 
> That's why I tried to write a code working for any GMP_NUMB_BITS value.
> I attach a proposed patch. GCC seems able to optimize out all the additional computations
> needed only when GMP_NUMB_BITS is odd.

Nice!  Well, honestly it's messy, but it's nice to see it actually done.

But... something's niggling me.  Is it okay to strip off an odd
number of lsbits in general?  Per the Legendre symbol (2/p), this
changes squareness mod p if p == 3 or 5 (mod 8).

To keep the examples small, suppose GMP_NUMB_BITS = 7.
4096 = 0x1000 = 0b0100000_0000000 is obviously square.

But after stripping off a factor of 128, the residue 32
is *not* square, which will give an incorrect answer.

I think we have to work a little harder to fix it. :-(

The obvious (even messier) solution is to unroll the search
by a factor of 2 and strip off pairs of lslimbs.

For the COUNT_TRAILING_ZEROS_SLOW path, it's easy to eliminate
the skip-zero-limbs code and write it so a limb of 0 passes (is
possibly square) the test.  But #ifndef COUNT_TRAILING_ZEROS_0,
detecting zero limbs isn't optional.

Well... maybe.

If count_trailing_zeros(0) is fully nasal-demons Undefined
(in practice, might divide by 0 or something), then we have
a problem.  But if it returns a value, even an unpredictable
one, then it can be masked.

E.g.
    count_trailing_zeros(cnt, lo)
    lo >>= cnt & 62;
    if (lo & 6)
      return 0;

If lo == 0, then the shift is harmless.

Indeed, even if the shift isn't masked, on every processor
I've ever heard of, the shift is reduced modulo *something*
and performed without faulting.

E.g. on ARM and 8086, it's modulo-256.  On 386 and up, it's mod-32,
except for 64-bit values, where it's mod-64.

So (0 >> n) == 0 for all n.

Is this something we're comfortable relying on?

> I hardly measure any difference, between the current code (#if 1) and the proposed one (#if 0).
> That's why I changed tune/speed again, now mpn_perfect_square_p.2 measures the time
> spent by the function on operands with the lowest 2 limbs set to zero.
> 
> https://gmplib.org/repo/gmp/rev/48b8f3a3c4ce

Just FYI, I get "403 Forbidden" for this, so I'm not sure what the
".1" and ".2" option differences actually are.

> It seems slightly slower in general, even if it can be much faster when skipping zero limbs. But is this case relevant? For the mpn_ layer, we may assume that if this case is frequent, the user code can handle it efficiently. on the other side, that's not equally easy for the mpz_ layer.
> Should we use that version of the code? Or just change it while keeping it disabled :-) ?

This is teetering on the edge, but illustrates that lookup tables
aren't always the fastest solution any more.  An L1D cache hit is
4 cycles nowadays, while a 64x64-bit multiply is 3 cycles, and
execution is so wide that you can mostly ignore factors other than
latency.

And that's assuming an L1D cache hit.  A microbenchmark like this
is testing the cache-hot case, which is definitely relevant to
some workloads, but for a helper function which is only used
occasionally, there might be competition for the L1D cache.