broot vs brootinv performancs
Niels Möller
nisse at lysator.liu.se
Thu Nov 1 22:27:59 CET 2012
Torbjorn Granlund <tg at gmplib.org> writes:
> nisse at lysator.liu.se (Niels Möller) writes:
>
> I've tried some more benchmarking (sorry, patches to speed not yet
> checked in). To my disapppointment, broot appear to be 20% slower than
> brootinv (except for small sizes, where it wins because it does the
> initial few iterations without bignum overhead).
>
> I suppose you then kindly disregard brootinv's need for a binvert call?
Yes.
> For rootexact, then surely broot will be faster. Or?
I'm not sure.
> I suppose y and a are the input, i.e a = y and we are computing a^(1/k)
> mod 2^t?
Right, sorry for the confused notation.
> The mullo_n is the same. The sqr + powlo ought to be roughly the same as
> the powlo i brootinv (consider small k so windowing is not useful). I
> had expected a significant win because a mpn_neg of size n/2 ought to be
> faster than the combination of mul_1, sub_n of size n and bdiv_q_1 of
> size n/2.
>
> I think (r^2)^{(k+1)/2} looks strange. Isn't r^{k+1} equivalent?
The point is that we're doubling the precision, input r is n/2 limbs and
output is n limbs. So the powering in both cases is a powlo of n limbs.
While r^2 is a squaring with n/2 input limbs and n output limbs. And a
size n/2 sqr ought to be a bit faster than a size n sqrlo. And it's also
somewhat cleaner to not read the uninitialized high part of r.
> Maybe I ought to power a and r at the same time as
>
> a^{k-1} r^{k+1} = (a*r)^{k-1} * r^2
>
> or so. Some wraparound or mulmid tricks may be applicable to
>
> Why would that be faster?
Because I can skip the computation of a^{k-1}. It will be faster only if
it doesn't add anything expensive in the loop.
> Does mpn_neg take measurable time?
Probably not here.
But in general, I'd expect the ratio of mpn_neg / mpn_pi1_bdiv_q_1 to be
measurably above zero. mpn_neg ought to take at least one or two c/l,
and mpn_pi1_bdiv_q_1 seem to take ten c/l on K10.
I just pushed in a brootinv with all sizes in limbs. And a
single-precision loop for the first limb. It's much faster for small
sizes. For perfect_power_p for small operands I've seen a
modest speedup of 15%, no regression.
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list