Torbjorn Granlund tg at gmplib.org
Mon Feb 20 10:40:38 CET 2012

```nisse at lysator.liu.se (Niels Möller) writes:

1. It uses a temporary array, apparently for systems lacking an
mpn_sqr_diag_addlsh1 which can do in-place operation. I think one can
support arbitrary sizes with a small temporary area if one collects
the off-diagonal terms into the result area, and then computes the
diagonal products blockwise, a hundred limbs at a time or so.

Why would one want to support such large sizes?

Current code uses the opposite allocation, with diagonal terms in
the result area and the off-diagonal terms in the temporary array.

Old x86 code (p6, k6, k7) might do it like that.  The 9 other assembly
files for sqr_basecase handle arbitrary operands.

2. One could do the shifting differently, applying the shift to the limb
argument of addmul_1. Something like, when doing the off-diagonal
products for up[i],

mpn_addmul_1 (rp + 2*i+1, up + i + 1, n - i - 1,
(up[i] << 1) + (up[i-1] >> (GMP_LIMB_BITS - 1)));

Might be cheaper, if we can get this shifting done in parallel with
other operations, and get a simpler carry propagation recurrency when
adding diagonal and off-diagonal terms together.

And then handle carry-out from the last shift a a conditional add_n.

3. The comments on using addmul_2 says that is is tricky. I think that's
because the diagonal terms are still 1x1 products. That would get
simpler of one treats double-limbs as the units everywhere, having
the diagonal computation form the 2x2 products

(u0, u1)^2, (u2, u3)^2, ...

The total number of limb products ought to be the same, if we do each
of these terms as

(u0 + u1 B)^2 = u0^2 + B^2 u1^2 + 2B (u0 * u1)
= u0^2 + B^2 u1^2 + B (u0 << 1) * u1 + B^2 u1 & HIGH_BIT_TO_MASK(u0)

The off-diagonal terms, to be computed with addmul_2, are then

(u0, u1) * (u2, u3, ...)
(u2, u3) * (u4, u5, ...)
...

I guess one can also collect the close-to-diagonal terms B u0 u1 +
B^5 u2 u3 + ..., together with the other off-diagonal terms,
One would then get the off-diagonal sum

B u0*u1 + B^2 (u0, u1) * (u2, u3, ...)
B^5 u2*u3 + B^6 (u2, u3) * (u4, u5, ...)

which begs for an mpn_addmul_1c accepting an additional carry input
limb. Is there such a function?

I beg to differ about the greatness of this approach.  It might make
some parts of the code look simpler, but will it be faster?  The C
sqr_basecase code is a bit hairy, but mainly because of the many
variants, but the many variants are there for best performance
everywhere.

4. There's code to use mpn_addmul_2s. What is that function supposed to
do, is it doing the above sum?

No.  It is a addmul_2 which suppresses a final mul to avoid including
the diagonal product.  I don't think any processor explicitly provides
it (although it is mentioned in several assembly files).  I have
explicit such sparc64 code in a repo someplace.

--
Torbjörn
```