broot vs brootinv performancs
Torbjorn Granlund
tg at gmplib.org
Tue Nov 13 11:34:25 CET 2012
nisse at lysator.liu.se (Niels Möller) writes:
> For rootexact, then surely broot will be faster [than brootinv]. Or?
I'm not sure.
I haven't had time to spend enough attention on this, but I continue to
be puzzled about the relative speed of these functions.
* The cost of mpn_binvert should be forbidding in certain perfpow
usages. (Perhaps that could be smoothed out by computing the needed
mpn_binvert Hensel lifting on demand, in the loop.)
* The broot loop doesn't look more expensive than brootinv's.
The point is that we're doubling the precision, input r is n/2 limbs and
output is n limbs. So the powering in both cases is a powlo of n limbs.
While r^2 is a squaring with n/2 input limbs and n output limbs. And a
size n/2 sqr ought to be a bit faster than a size n sqrlo. And it's also
somewhat cleaner to not read the uninitialized high part of r.
Clever. Perhaps this could be one in more places?
> Maybe I ought to power a and r at the same time as
>
> a^{k-1} r^{k+1} = (a*r)^{k-1} * r^2
>
> or so. Some wraparound or mulmid tricks may be applicable to
>
> Why would that be faster?
Because I can skip the computation of a^{k-1}. It will be faster only if
it doesn't add anything expensive in the loop.
It adds a multiply, but then it removes one as well...
> Does mpn_neg take measurable time?
Probably not here.
But in general, I'd expect the ratio of mpn_neg / mpn_pi1_bdiv_q_1 to be
measurably above zero. mpn_neg ought to take at least one or two c/l,
and mpn_pi1_bdiv_q_1 seem to take ten c/l on K10.
I expect neg (could be made to) run at 1 c/l on current machines. But I
have no idea about how it runs today, since it's not measured.
I cannot recall where the redc/bdiv stuff arrived last time we worked on
it. We made an effort at unification, making them behave exactly the
same, perhaps eliminating redc.
I also don't recall if a bdiv style quotient can be "negated" at no
extra cost, but I suppose it can.
I just pushed in a brootinv with all sizes in limbs. And a
single-precision loop for the first limb. It's much faster for small
sizes. For perfect_power_p for small operands I've seen a
modest speedup of 15%, no regression.
Nice!
--
Torbjörn
More information about the gmp-devel
mailing list