mpn_mul_2c
Niels Möller
nisse at lysator.liu.se
Sun Feb 26 09:50:37 CET 2012
Torbjorn Granlund <tg at gmplib.org> writes:
> Are you rewriting x86_64 sqr_basecase with calls to mul_2? If that's
> faster than the present code, then I think a version with these mul_2c
> inlined will be even better.
You're right it's somewhat experimental.
About sqr_basebase, I have two working versions. The addmul_1c based
version uses this loop, which handles one diagonal term and the
off-diagonal terms:
for (i = 1; i < n-2; i++)
{
c0 = ul >> (GMP_LIMB_BITS - 1);
ul = up[i];
ulp = ul + c0;
c1 = ulp < c0;
umul_ppmm (p1, p0, ul, ulp);
add_ssaaaa (p1, rp[2*i], p1, p0, -c1, rp[2*i]);
rp[n+i] = mpn_addmul_1c (rp + 2*i+1, up + i + 1, n - i - 1,
(ul << 1) + c0, p1);
}
It can't compete with current sqr_basecase on x86_64, since the latter
uses addmul_2. It might be useful on other architectures. But it kind-of
begs for assembler implementation, to reduce the linear work outside of
the addmul_1c call (or maybe it would be good enough with just a special
addmul_1 entrypoint).
Then I have a version based on addmul_2c (which also uses mul_2c for the
first round). It uses this loop for the diagonal terms,
for (; i < n-2; i += 2)
{
umul_ppmm (p1, p0, up[i], up[i+1]);
add_ssaaaa (p1, rp[2*i+1], p1, rp[2*i+1], 0, p0);
rp[n + i + 1] = mpn_addmul_2c (rp + 2*i + 2, up + i + 2, n - i - 2,
up + i, p1);
}
The iteration adds
u_i * u_{i+1}
+ B * <u_{i+1}, u_i> * <u_{n-1}, ..., u_{i+2}>
with the larger term computed by addmul_2c. This version is some 10%
slower than current sqr_basecase x86_64. Anyway, I think this
organization is promising and simpler than the current addmul_2 loop in
sqr_basecase.c.
This does *not* shift the offdiagonal terms on the fly; I think that
would be a bit too cumbersome in C, maybe it would make sense in
assembler. So one needs an mpn_sqr_diag_addlsh1 as well. I'm considering
doing that (preferably in assembler) with the following iteration
+-------+
|u * u|
+---+---+---+
2* | r1| r0|
+---+---+
+ | h1| h0|
--+-------+---+---+
|h1'|h0'| r1| r0|
+---+---+---+---+
Then one can compute u * u + 2*r1 withut carry propagating further. But
unfortunately we do need the high carry h1, we can get the maximum value
h1 = 1 and h0 = 0. Or do you think it would be better to use an
organization like in the general addlsh1_n, computing as many of the
diagonal products as one can fit in registers, and then doing a carry
propagation add of some 4, 6 limbs or even eight limbs at a time?
> Looks OK, except if the x86_64 asm mul_2c will never be used, I think
> that change is somewhat questionable, and could be kept local.
Well, I can keep it local for the time being. I also have a patch with
an addmul_2c entrypoint now.
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list