# broot vs brootinv performancs

Niels Möller nisse at lysator.liu.se
Thu Nov 1 22:27:59 CET 2012

```Torbjorn Granlund <tg at gmplib.org> writes:

> nisse at lysator.liu.se (Niels Möller) writes:
>
>   I've tried some more benchmarking (sorry, patches to speed not yet
>   checked in). To my disapppointment, broot appear to be 20% slower than
>   brootinv (except for small sizes, where it wins because it does the
>   initial few iterations without bignum overhead).
>
> I suppose you then kindly disregard brootinv's need for a binvert call?

Yes.

> For rootexact, then surely broot will be faster.  Or?

I'm not sure.

> I suppose y and a are the input, i.e a = y and we are computing a^(1/k)
> mod 2^t?

Right, sorry for the confused notation.

>   The mullo_n is the same. The sqr + powlo ought to be roughly the same as
>   the powlo i brootinv (consider small k so windowing is not useful). I
>   had expected a significant win because a mpn_neg of size n/2 ought to be
>   faster than the combination of mul_1, sub_n of size n and bdiv_q_1 of
>   size n/2.
>
> I think (r^2)^{(k+1)/2} looks strange.  Isn't r^{k+1} equivalent?

The point is that we're doubling the precision, input r is n/2 limbs and
output is n limbs. So the powering in both cases is a powlo of n limbs.
While r^2 is a squaring with n/2 input limbs and n output limbs. And a
size n/2 sqr ought to be a bit faster than a size n sqrlo. And it's also
somewhat cleaner to not read the uninitialized high part of r.

>   Maybe I ought to power a and r at the same time as
>
>     a^{k-1} r^{k+1} = (a*r)^{k-1} * r^2
>
>   or so. Some wraparound or mulmid tricks may be applicable to
>
> Why would that be faster?

Because I can skip the computation of a^{k-1}. It will be faster only if
it doesn't add anything expensive in the loop.

> Does mpn_neg take measurable time?

Probably not here.

But in general, I'd expect the ratio of mpn_neg / mpn_pi1_bdiv_q_1 to be
measurably above zero. mpn_neg ought to take at least one or two c/l,
and mpn_pi1_bdiv_q_1 seem to take ten c/l on K10.

I just pushed in a brootinv with all sizes in limbs. And a
single-precision loop for the first limb. It's much faster for small
sizes. For perfect_power_p for small operands I've seen a
modest speedup of 15%, no regression.

Regards,
/Niels

--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
```