A perfect power, and then?

Thu Oct 25 22:48:55 CEST 2012

nisse at lysator.liu.se (Niels Möller) writes:

  I don't fully understand current perfpow.c. But some comments
  nevertheless, for possible improvements:

I used to understand it well, when I supervised Martin a few years back.

  1. We seem to get from inverted root to plain root by binverting the
     input. For the binv_sqrt case, it's seems clearly cheaper to do it
     with a mullo. For binv_root, it's not as obvious since we need a
     different iteration to converge to x^{1/n - 1} rather than to
     x^{-1/n}, but I think both iterations should be about the same cost,
     so that mullo + other iteration is cheaper than binvert + current
     iteration.

OK.  One might also want to consider which is the most useful function.

     Now, the inversion in perfpow appear to be amortized over several
     is_kth_power calls. I'm not sure what that means for performance.
     Since these calls will work with roots of different sizes, a single
     large binvert may still be slower than a number of mullo's of various
     smaller sizes.

I didn't get that.

  2. The iterations take no advantage of cancellation. E.g., in the
     binv_sqrt, we should do the iteration as

       x' <- x - x * (a x^2 - 1)/2

     Here, low part of a x^2 - 1 is always zero, and when we next multiply it
     by x, we don't need to multiply by those zero limbs. I guess we could
     also make some use of mpn_rsh1_sub.

And mulmid?  (mpn_rsh1sub_n is used for the current iteration at line
187.)

  3. In pow_equals, we compute and compare the least significant limbs of
     the product first. But the low xn limbs will *always* match since
     they're computed as a mod B^{xn} root, right? I think a better test
     would be to compute the *most* significant single limb of the power.
     I think that's doable with only double-limb arithmetic (umul_ppmm) a
     relative error on the order of k/B or so. Such a check should give a
     pretty small probability of false positives, so if the high limb is
     close enough, it may be simplest and good enough to always compute
     all the limbs for a final check.

Sounds right.  Such convolution type sum-of-products might want a
separate function, returning 3 limbs.

  4. Use wraparound where possible in the newton iterations.

Yep.  Or mulmid.

  5. In binv_root, I think the number of bits of precision is *exactly*
     doubled in each iteration. At least I'm fairly confident that it does
     for the interation in my code, and I think that's how Hensel lifting
     usually works. The reason why sqrt does not exactly double the number
     of bits is that we divide by the exponent (2), which in this case is
     a factor of the modulo, which makes it a somewhat strange Hensel
     lift.

Your observation is consistent with how the 'order' vectors are
initialised.

Perhaps you could add some of your remarks as a comment to the file?

-- 
Torbjörn