Alternative div_qr_1

Tue Jun 15 17:30:11 CEST 2010

nisse at lysator.liu.se (Niels Möller) writes:

  nisse at lysator.liu.se (Niels Möller) writes:

  > The algorithm is really simple, the iteration is
  >
  >   <r2, r1, r0>  <-- <r0, u[j]> + r1 b2 + r2 b3

  I've now implemented a slightly different variant,

    <r2, r1, r0>  <-- <r0, u[j]> + (r2 b2 - c d) B + r1 b2

  where c is the carry from r0 + r2 b2 (here, r2 is a single bit, and in
  fact <r2, r1>  < B + d - 1). Note that the b3 constant is no longer
  needed.

What is d?  I don't see it in the code.  What is B2mb of the code?

I assume one may write your iteration slightly simpler like this:

    <r2, r1, r0>  <- (r0 + r2 b2 - c d) B + r1 b2 + u[j]

  I'm attaching a new x86_64/mod_1_1.asm. The function mpn_mod_1_1p is
  rewritten from scratch, while mpn_mod_1_1p_cps is unchanged.

I have rewritten the cps function, and pushed it.  I might also have
made a few cosmetic changes to the main function.  Beware of conflicts
if you update!

(I added a SHLD_SLOW m4 parameter, the idea being that we should set
that in gmp-mparam.h for VIA Nano and Intel atom, instead of having more
separate files.  You might want to do the same.)

  Runs at 6 cycles per limb on AMD (panther.gmplib.org). 

Nice!  It runs faster than the old code also on corei and core2.

  On Intel corei (shell.gmplib.org) results are a little odd; it runs at
  11 cycles per limb when the divisor is unnormalized, and 11.5 when the
  divisor is normalized. Which is strange, since there's no instruction in
  the loop which I'd expect to have data-dependent timing.

I can only reproduce *some* data dependency, but not quite like that:

              unnorm         norm
1053         #11.1112       11.1693
1158         #11.1265       11.1803
1273         #11.1508       11.1606
1400          11.2054       11.0793
1540          11.2064       11.1621
1694          11.1866      #11.0681
1863         #11.0945       11.1504
2049          11.1406      #11.1297
2253         #11.0609       11.3940
2478         #11.1096       11.1361
2725         #11.1079       11.2948
2997         #11.1052       11.2135

So for n=2253 we're at about 11.4, for other sizes things run well.
This seems very reproducible.  Since we're reading from just one vector,
alignment should not be an issue, so it ought to just be data related.

I suspect we enter the loop with different pipeline state.  Remember all
P6 processors' register file read port challenges; I suspect that's
where we ought to look for the explanation.

For repentium.gmplib.org, a core2, the effect is more pronounced:

              unnorm         norm
1053         #12.3559       12.5564
1158         #12.3411       12.5681
1273         #12.3414       12.5888
1400         #12.3232       12.6318
1540         #12.3447       12.5185
1694         #12.3239       12.5408
1863         #12.3456       12.5639
2049         #12.3635       12.5597
2253         #12.2976       12.4961
2478         #12.2884       12.5203
2725         #12.2947       12.5466
2997         #12.3461       12.4860

  I'd like to extend this to also compute the quotient, then in each
  iteration one needs to also compute (in the normalized case; I haven't
  thought about the unnormalized case yet):

    <q2, q1, q0>  <--  <r2, r1> <1, v> + c B
                    = r2 <1, v> + r1 v + (r1 + c) B

  where c is the same carry out as in the r update. And then accumulate
  those limbs. Unfortunately, the simple way of doing those additions
  takes an awful lot of instructions. The critical path is still the
  dependency on r1, which will be 7 cycles (one additional move on the
  path, to free up %rax for the r1 v multiply in the q computation). So
  we're going to be limited by decoder bandwidth rather than latency. I
  hope that one should be able to get a loop running in 11 or 10 cycles.

I need to think more about this.

We have a 'classical' div_qr_1_pi2 running at 10.3 c/l since long.  (It
is not yet in the repo.)

-- 
Torbjörn