About mpn_sqr_basecase

Mon Feb 20 10:40:38 CET 2012

nisse at lysator.liu.se (Niels Möller) writes:

  1. It uses a temporary array, apparently for systems lacking an
     mpn_sqr_diag_addlsh1 which can do in-place operation. I think one can
     support arbitrary sizes with a small temporary area if one collects
     the off-diagonal terms into the result area, and then computes the
     diagonal products blockwise, a hundred limbs at a time or so.

Why would one want to support such large sizes?

     Current code uses the opposite allocation, with diagonal terms in
     the result area and the off-diagonal terms in the temporary array.

Old x86 code (p6, k6, k7) might do it like that.  The 9 other assembly
files for sqr_basecase handle arbitrary operands.

  2. One could do the shifting differently, applying the shift to the limb
     argument of addmul_1. Something like, when doing the off-diagonal
     products for up[i],

       mpn_addmul_1 (rp + 2*i+1, up + i + 1, n - i - 1,
  		   (up[i] << 1) + (up[i-1] >> (GMP_LIMB_BITS - 1)));

     Might be cheaper, if we can get this shifting done in parallel with
     other operations, and get a simpler carry propagation recurrency when
     adding diagonal and off-diagonal terms together.

And then handle carry-out from the last shift a a conditional add_n.

  3. The comments on using addmul_2 says that is is tricky. I think that's
     because the diagonal terms are still 1x1 products. That would get
     simpler of one treats double-limbs as the units everywhere, having
     the diagonal computation form the 2x2 products

       (u0, u1)^2, (u2, u3)^2, ...

     The total number of limb products ought to be the same, if we do each
     of these terms as

       (u0 + u1 B)^2 = u0^2 + B^2 u1^2 + 2B (u0 * u1)
       	           = u0^2 + B^2 u1^2 + B (u0 << 1) * u1 + B^2 u1 & HIGH_BIT_TO_MASK(u0)

     The off-diagonal terms, to be computed with addmul_2, are then

       (u0, u1) * (u2, u3, ...)
               (u2, u3) * (u4, u5, ...)
               ...

     I guess one can also collect the close-to-diagonal terms B u0 u1 +
     B^5 u2 u3 + ..., together with the other off-diagonal terms,
     One would then get the off-diagonal sum

       B u0*u1 + B^2 (u0, u1) * (u2, u3, ...)
         B^5 u2*u3 + B^6 (u2, u3) * (u4, u5, ...)

     which begs for an mpn_addmul_1c accepting an additional carry input
     limb. Is there such a function?

I beg to differ about the greatness of this approach.  It might make
some parts of the code look simpler, but will it be faster?  The C
sqr_basecase code is a bit hairy, but mainly because of the many
variants, but the many variants are there for best performance
everywhere.

  4. There's code to use mpn_addmul_2s. What is that function supposed to
     do, is it doing the above sum?

No.  It is a addmul_2 which suppresses a final mul to avoid including
the diagonal product.  I don't think any processor explicitly provides
it (although it is mentioned in several assembly files).  I have
explicit such sparc64 code in a repo someplace.

-- 
Torbjörn