# broot vs brootinv performancs

Torbjorn Granlund tg at gmplib.org
Tue Nov 13 11:34:25 CET 2012

```nisse at lysator.liu.se (Niels Möller) writes:

> For rootexact, then surely broot will be faster [than brootinv].  Or?

I'm not sure.

I haven't had time to spend enough attention on this, but I continue to
be puzzled about the relative speed of these functions.

* The cost of mpn_binvert should be forbidding in certain perfpow
usages.  (Perhaps that could be smoothed out by computing the needed
mpn_binvert Hensel lifting on demand, in the loop.)

* The broot loop doesn't look more expensive than brootinv's.

The point is that we're doubling the precision, input r is n/2 limbs and
output is n limbs. So the powering in both cases is a powlo of n limbs.
While r^2 is a squaring with n/2 input limbs and n output limbs. And a
size n/2 sqr ought to be a bit faster than a size n sqrlo. And it's also
somewhat cleaner to not read the uninitialized high part of r.

Clever.  Perhaps this could be one in more places?

>   Maybe I ought to power a and r at the same time as
>
>     a^{k-1} r^{k+1} = (a*r)^{k-1} * r^2
>
>   or so. Some wraparound or mulmid tricks may be applicable to
>
> Why would that be faster?

Because I can skip the computation of a^{k-1}. It will be faster only if
it doesn't add anything expensive in the loop.

It adds a multiply, but then it removes one as well...

> Does mpn_neg take measurable time?

Probably not here.

But in general, I'd expect the ratio of mpn_neg / mpn_pi1_bdiv_q_1 to be
measurably above zero. mpn_neg ought to take at least one or two c/l,
and mpn_pi1_bdiv_q_1 seem to take ten c/l on K10.

I expect neg (could be made to) run at 1 c/l on current machines.  But I
have no idea about how it runs today, since it's not measured.

I cannot recall where the redc/bdiv stuff arrived last time we worked on
it.  We made an effort at unification, making them behave exactly the
same, perhaps eliminating redc.

I also don't recall if a bdiv style quotient can be "negated" at no
extra cost, but I suppose it can.

I just pushed in a brootinv with all sizes in limbs. And a
single-precision loop for the first limb. It's much faster for small
sizes. For perfect_power_p for small operands I've seen a
modest speedup of 15%, no regression.

Nice!

--
Torbjörn
```