Niels Möller nisse at
Sun Feb 26 09:50:37 CET 2012

Torbjorn Granlund <tg at> writes:

> Are you rewriting x86_64 sqr_basecase with calls to mul_2?  If that's
> faster than the present code, then I think a version with these mul_2c
> inlined will be even better.

You're right it's somewhat experimental.

About sqr_basebase, I have two working versions. The addmul_1c based
version uses this loop, which handles one diagonal term and the
off-diagonal terms:

  for (i = 1; i < n-2; i++)
      c0 = ul >> (GMP_LIMB_BITS - 1);
      ul = up[i];
      ulp = ul + c0;
      c1 = ulp < c0;
      umul_ppmm (p1, p0, ul, ulp);
      add_ssaaaa (p1, rp[2*i], p1, p0, -c1, rp[2*i]);

      rp[n+i] = mpn_addmul_1c (rp + 2*i+1, up + i + 1, n - i - 1,
      		       (ul << 1) + c0, p1);

It can't compete with current sqr_basecase on x86_64, since the latter
uses addmul_2. It might be useful on other architectures. But it kind-of
begs for assembler implementation, to reduce the linear work outside of
the addmul_1c call (or maybe it would be good enough with just a special
addmul_1 entrypoint).

Then I have a version based on addmul_2c (which also uses mul_2c for the
first round). It uses this loop for the diagonal terms,

  for (; i < n-2; i += 2)
      umul_ppmm (p1, p0, up[i], up[i+1]);
      add_ssaaaa (p1, rp[2*i+1], p1, rp[2*i+1], 0, p0);
      rp[n + i + 1] = mpn_addmul_2c (rp + 2*i + 2, up + i + 2, n - i - 2,
      			     up + i, p1);

The iteration adds

   u_i * u_{i+1} 
       + B * <u_{i+1}, u_i> * <u_{n-1}, ..., u_{i+2}>

with the larger term computed by addmul_2c. This version is some 10%
slower than current sqr_basecase x86_64. Anyway, I think this
organization is promising and simpler than the current addmul_2 loop in

This does *not* shift the offdiagonal terms on the fly; I think that
would be a bit too cumbersome in C, maybe it would make sense in
assembler. So one needs an mpn_sqr_diag_addlsh1 as well. I'm considering
doing that (preferably in assembler) with the following iteration

       |u  *  u|
        2* | r1| r0|
  +        | h1| h0|
   |h1'|h0'| r1| r0|

Then one can compute u * u + 2*r1 withut carry propagating further. But
unfortunately we do need the high carry h1, we can get the maximum value
h1 = 1 and h0 = 0. Or do you think it would be better to use an
organization like in the general addlsh1_n, computing as many of the
diagonal products as one can fit in registers, and then doing a carry
propagation add of some 4, 6 limbs or even eight limbs at a time?

> Looks OK, except if the x86_64 asm mul_2c will never be used, I think
> that change is somewhat questionable, and could be kept local.

Well, I can keep it local for the time being. I also have a patch with
an addmul_2c entrypoint now.


Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

More information about the gmp-devel mailing list