# mpn_mul_2c

Niels Möller nisse at lysator.liu.se
Sun Feb 26 09:50:37 CET 2012

```Torbjorn Granlund <tg at gmplib.org> writes:

> Are you rewriting x86_64 sqr_basecase with calls to mul_2?  If that's
> faster than the present code, then I think a version with these mul_2c
> inlined will be even better.

You're right it's somewhat experimental.

version uses this loop, which handles one diagonal term and the
off-diagonal terms:

for (i = 1; i < n-2; i++)
{
c0 = ul >> (GMP_LIMB_BITS - 1);
ul = up[i];
ulp = ul + c0;
c1 = ulp < c0;
umul_ppmm (p1, p0, ul, ulp);
add_ssaaaa (p1, rp[2*i], p1, p0, -c1, rp[2*i]);

rp[n+i] = mpn_addmul_1c (rp + 2*i+1, up + i + 1, n - i - 1,
(ul << 1) + c0, p1);
}

It can't compete with current sqr_basecase on x86_64, since the latter
uses addmul_2. It might be useful on other architectures. But it kind-of
begs for assembler implementation, to reduce the linear work outside of
the addmul_1c call (or maybe it would be good enough with just a special

Then I have a version based on addmul_2c (which also uses mul_2c for the
first round). It uses this loop for the diagonal terms,

for (; i < n-2; i += 2)
{
umul_ppmm (p1, p0, up[i], up[i+1]);
add_ssaaaa (p1, rp[2*i+1], p1, rp[2*i+1], 0, p0);
rp[n + i + 1] = mpn_addmul_2c (rp + 2*i + 2, up + i + 2, n - i - 2,
up + i, p1);
}

u_i * u_{i+1}
+ B * <u_{i+1}, u_i> * <u_{n-1}, ..., u_{i+2}>

with the larger term computed by addmul_2c. This version is some 10%
slower than current sqr_basecase x86_64. Anyway, I think this
organization is promising and simpler than the current addmul_2 loop in
sqr_basecase.c.

This does *not* shift the offdiagonal terms on the fly; I think that
would be a bit too cumbersome in C, maybe it would make sense in
assembler. So one needs an mpn_sqr_diag_addlsh1 as well. I'm considering
doing that (preferably in assembler) with the following iteration

+-------+
|u  *  u|
+---+---+---+
2* | r1| r0|
+---+---+
+        | h1| h0|
--+-------+---+---+
|h1'|h0'| r1| r0|
+---+---+---+---+

Then one can compute u * u + 2*r1 withut carry propagating further. But
unfortunately we do need the high carry h1, we can get the maximum value
h1 = 1 and h0 = 0. Or do you think it would be better to use an
organization like in the general addlsh1_n, computing as many of the
diagonal products as one can fit in registers, and then doing a carry
propagation add of some 4, 6 limbs or even eight limbs at a time?

> Looks OK, except if the x86_64 asm mul_2c will never be used, I think
> that change is somewhat questionable, and could be kept local.

Well, I can keep it local for the time being. I also have a patch with