broot vs brootinv performancs

Tue Nov 13 15:50:29 CET 2012

Torbjorn Granlund <tg at gmplib.org> writes:

> nisse at lysator.liu.se (Niels Möller) writes:
>
>   > For rootexact, then surely broot will be faster [than brootinv].  Or?
>   
>   I'm not sure.
>   
> I haven't had time to spend enough attention on this, but I continue to
> be puzzled about the relative speed of these functions.
>
> * The cost of mpn_binvert should be forbidding in certain perfpow
>   usages.  (Perhaps that could be smoothed out by computing the needed
>   mpn_binvert Hensel lifting on demand, in the loop.)

Trying

$ ./speed -o cycles-broken -s 1-30,100,200 -r mpn_mullo_n mpn_binvert

it appears binvert is up to 5 times slower than mullo for just a few
limbs, 2 times slower for n = 10, and roughly 50% slower in the range
between 20 and 200 limbs. So replacing a binvert by a mulllo is not that
big an optimization.

Asymptotically, mullo is M(n) and (b)invert is 2 M(n) or something like
that, right?

> * The broot loop doesn't look more expensive than brootinv's.

No, the loops use about the same operations. But with broot_invm1, we
have an additional large powlo outside of the loop. My best guess is
that is the reason why it is slower.

>   The point is that we're doubling the precision, input r is n/2 limbs and
>   output is n limbs. So the powering in both cases is a powlo of n limbs.
>   While r^2 is a squaring with n/2 input limbs and n output limbs. And a
>   size n/2 sqr ought to be a bit faster than a size n sqrlo. And it's also
>   somewhat cleaner to not read the uninitialized high part of r.
>   
> Clever.  Perhaps this could be one in more places?

I'll see if/when I get to further optimizations of the other functions.

> I cannot recall where the redc/bdiv stuff arrived last time we worked on
> it.  We made an effort at unification, making them behave exactly the
> same, perhaps eliminating redc.

In the repo ~hg/gmp-proj/gmp-bdiv, the bdiv functions are changed to the
new convention with different signs of q, and callers updated. The redc
uses are not touched yet, and bdiv_2 (to replace redc_2) is missing.
IIRC.

> I also don't recall if a bdiv style quotient can be "negated" at no
> extra cost, but I suppose it can.

For bdiv_1, I'm pretty sure there's no extra cost. And since those loops
are small, I think it makes some sense to provide both variants. Maybe
also for bdiv_2, if we do those.

For larger divisors, it's addmul_1 vs submul_1 (and addmul_2 vs missing
submul_2), and since negation is going to be relatively cheap, I *don't*
think we should have functions for both signs of q.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.