broot vs brootinv performancs

Tue Nov 13 11:34:25 CET 2012

nisse at lysator.liu.se (Niels Möller) writes:

  > For rootexact, then surely broot will be faster [than brootinv].  Or?

  I'm not sure.

I haven't had time to spend enough attention on this, but I continue to
be puzzled about the relative speed of these functions.

* The cost of mpn_binvert should be forbidding in certain perfpow
  usages.  (Perhaps that could be smoothed out by computing the needed
  mpn_binvert Hensel lifting on demand, in the loop.)

* The broot loop doesn't look more expensive than brootinv's.

  The point is that we're doubling the precision, input r is n/2 limbs and
  output is n limbs. So the powering in both cases is a powlo of n limbs.
  While r^2 is a squaring with n/2 input limbs and n output limbs. And a
  size n/2 sqr ought to be a bit faster than a size n sqrlo. And it's also
  somewhat cleaner to not read the uninitialized high part of r.

Clever.  Perhaps this could be one in more places?

  >   Maybe I ought to power a and r at the same time as
  >   
  >     a^{k-1} r^{k+1} = (a*r)^{k-1} * r^2
  >   
  >   or so. Some wraparound or mulmid tricks may be applicable to
  >
  > Why would that be faster?

  Because I can skip the computation of a^{k-1}. It will be faster only if
  it doesn't add anything expensive in the loop.

It adds a multiply, but then it removes one as well...

  > Does mpn_neg take measurable time?

  Probably not here.

  But in general, I'd expect the ratio of mpn_neg / mpn_pi1_bdiv_q_1 to be
  measurably above zero. mpn_neg ought to take at least one or two c/l,
  and mpn_pi1_bdiv_q_1 seem to take ten c/l on K10.

I expect neg (could be made to) run at 1 c/l on current machines.  But I
have no idea about how it runs today, since it's not measured.

I cannot recall where the redc/bdiv stuff arrived last time we worked on
it.  We made an effort at unification, making them behave exactly the
same, perhaps eliminating redc.

I also don't recall if a bdiv style quotient can be "negated" at no
extra cost, but I suppose it can.

  I just pushed in a brootinv with all sizes in limbs. And a
  single-precision loop for the first limb. It's much faster for small
  sizes. For perfect_power_p for small operands I've seen a
  modest speedup of 15%, no regression.

Nice!

-- 
Torbjörn