hgcd1/2

Sun Sep 15 22:55:48 UTC 2019

nisse at lysator.liu.se (Niels Möller) writes:

  > So one could perhaps provide a set of uniformly distribted
  > random numbers for div1 and get adequate results?

  Are you saying that the quotient of two independently uniformly
  distributed numbers has a similar distribution as the euclidean quotient
  sequence? Otherwise, one could tweak the distribution in some way. 

Not exactly.  I think running div1 like hgcd2 but without any of hgcd2
bookkeeping would make some sense.  I.e., feed div1 with the input of
Euclid's algorithm.  To avoid skew from particular operands, perhaps
table 10 uniformly distrinuted random numbers, then do all (a,b) pairs
from that table.

The idea is to make measurements more accurate.

  >   I'm appending another iteration of the patch to add div2 function based
  >   on div1 on the high limbs. Selected via HGCD2_DIV2_METHOD. Benchmarks:
  >
  >   HGCD2_DIV2_METHOD mpn_hgcd2_1   mpn_hgcd2_2   mpn_hgcd2_3
  >                   1    #1504.47       1740.53       1513.56  div1 + unlikely case
  >                   2    #1739.66       1858.01       1753.73  current code
  >                   3    #1615.32       1736.69       1621.95  the if:ed out version
  >
  > Looks good.  Ready for repo?

  I think it's quite ready, but not today. I also wonder if we can delete
  the "current" code right away, to reduce number ov variants. Or if we
  should wire it all up for tuning before deciding that.

The "current" div2, you mean?  The branchy variant is hardly ever going
to be the fastest, so pls go ahead and delete!

  > What is the current reasoning on suppressing the q = 1 code in the hgcd2
  > code calling div1 and div2?

  I don't quite remember, but the operations in the q == 1 case are

    sub_ddmmss      (or just ah -= bh in the single-limb loop)
    if (ah <= bh)   True 41.5%, if I find the right place in Knuth
    u01 += u00;
    u11 += u10;

  So that's pretty cheap, except for the branch each mispredictions can
  cost several cycles.

"Several" is an understatement.  A misprediced branch cost more than 10
cycles today.

  The case of general q obviously needs to compute a quotient, where div1
  gives a candidate quotient that's usually correct. Then the update needs
  two multiplies,

     u01 += q * u00;
     u11 += q * u10;

  In addition, the computation of the remainder is not not so cheap, at
  least not for the double-limb loop, where we get one or two additilan
  multiplies, some carry propagation, and logic for adjusting the div1
  quotient when it happens to be one off.

  My guess is that special q=1 is often beneficial for the double-limb
  loop, and rarely beneficial in the single-limb loop. But needs some
  measurements.

OK.

I've written several div1 in asm (arm v5 method 2, 64-bit arm v8 method
2 and 3, 64-bit x86 method 2).  Since these will not be static or
inlined, I call them mpn_div_11.  I am not sure how to time them within
your framework.

I realised one thing for div1: All current variant will pay a great
branch misprediction price except method 1.  Even method 3 pays, as q
>=8 happens in some 20% of the case, so average cost perhaps 20 cycles
times 0.2 = 4 cycles per call.  Or perhaps 8 as two branches must be
passed.

The new method 2 pays as the loop iteration will vary very much and very
unpredictably.  I expect the number of mispredictions will be > 1 per
call, actually.

This is why method 1 stands a chance, in spite of how slow the division
instructions generally are.

I now believe a table-based thing which gets the quotient right > 90% of
the time will be the best approach.  And perhaps hgcd2 could be made to
cope with inaccurate quotients, and that without itself executing
unpredictable branches?  As long as progress is made, perhaps sloppy
quotients can be tolerated?

-- 
Torbjörn
Please encrypt, key id 0xC8601622