Niels Möller nisse at lysator.liu.se
Mon Feb 20 11:59:55 CET 2012

```Torbjorn Granlund <tg at gmplib.org> writes:

> Why would one want to support such large sizes?

It would be nice to get rid of the SQR_TOOM2_THRESHOLD size restriction
in *all* squaring code. Since that restriction makes things a bit
brittle, and causes additional complexities for the fat case. I see good
good reason besides that.

Switching the use of the allocation areas seems like a simple way to get
rid of that. I've been looking mostly at the C sqr_basecase.

>   2. One could do the shifting differently, applying the shift to the limb
>      argument of addmul_1. Something like, when doing the off-diagonal
>      products for up[i],
>
>        mpn_addmul_1 (rp + 2*i+1, up + i + 1, n - i - 1,
>   		   (up[i] << 1) + (up[i-1] >> (GMP_LIMB_BITS - 1)));
>
>      Might be cheaper, if we can get this shifting done in parallel with
>      other operations, and get a simpler carry propagation recurrency when
>      adding diagonal and off-diagonal terms together.
>
> And then handle carry-out from the last shift a a conditional add_n.

It's not that bad. In this scheme, up[i] (shifted) is multiplied by the
n-1-i limbs up[i+1, ..., n-1], i.e., fewer as i increases. The final
off-diagonal iteration (i = n-2) then adds up[n-2] * up[n-1], so if we
shift up[n-2], we only need a conditional add of the single limb
up[n-1].

> I beg to differ about the greatness of this approach.

Care to elaborate on why you expect it to be to slow? I imagine carry
handling for the close-to-diagonal terms up[2k] * up[2k+1] will be slow
without assembler support. Or should we postpone this discussion until
there's some code to compare?

BTW, what do you think about the mpn_addmul_1c entrypoint? Would it make