broot vs brootinv performancs

Thu Nov 1 21:07:27 CET 2012

nisse at lysator.liu.se (Niels Möller) writes:

  I've tried some more benchmarking (sorry, patches to speed not yet
  checked in). To my disapppointment, broot appear to be 20% slower than
  brootinv (except for small sizes, where it wins because it does the
  initial few iterations without bignum overhead).

I suppose you then kindly disregard brootinv's need for a binvert call?

For perfpow's need (as I think you mentioned) the binvert time is
amortised over several brootinv calls.  It would be neater to avoid use
broot instead of binvert+brootinv.

For rootexact, then surely broot will be faster.  Or?

  I don't understand why. When describing the iterations below, let n be
  the target size of the iteraton.

  brootinv uses the iteration

    r' <-- ((k+1) r - r^{k+1} y) / k

  For broot, I've rewritten the iteration as

    r' <-- r - (a^{k-1} (r^2)^{(k+1)/2} - r) / k

  which gave a modest improvement.

I suppose y and a are the input, i.e a = y and we are computing a^(1/k)
mod 2^t?

  The mullo_n is the same. The sqr + powlo ought to be roughly the same as
  the powlo i brootinv (consider small k so windowing is not useful). I
  had expected a significant win because a mpn_neg of size n/2 ought to be
  faster than the combination of mul_1, sub_n of size n and bdiv_q_1 of
  size n/2.

I think (r^2)^{(k+1)/2} looks strange.  Isn't r^{k+1} equivalent?

  There's also a powlo (a^{k-1}) and mullo outside of the loop. I expected
  them to contribute little to the total running time. But mullo seems to
  take 13% for large sizes and mpn_powlo is not currently supported by
  speed, so it's not so easy to measure).

I would suggest that you time things on several machine types.  An AMD
Stars (eg shell) might be the best platform, since all asm code runs
best there.

  Maybe I ought to power a and r at the same time as

    a^{k-1} r^{k+1} = (a*r)^{k-1} * r^2

  or so. Some wraparound or mulmid tricks may be applicable to

Why would that be faster?

  Another note: I also wish for a "bdiv_nq_1" which returns - u / d (mod
  B^n) rather than u / d (mod B^n), then one could eliminate the mpn_neg.
  (And this is somewhat related to my bdiv work a few months ago, which
  also returns a quotient with different sign).

Does mpn_neg take measurable time?

-- 
Torbjörn