[PATCH] Optimize 32-bit sparc T1 multiply routines.
Torbjorn Granlund
tg at gmplib.org
Sun Jan 6 01:25:10 CET 2013
nisse at lysator.liu.se (Niels Möller) writes:
You could always do the two's complement of one of the operands on the
fly, and then use the same add with carry instructions as in add_n.
I'm thinking aloud, so I'm sorry if I get this wrong, but I think it's
best to handle the unlikely case of low zero limbs up front. Then it's a
plain negate of the first non-zero limb, and a plain complement for the
remaining limbs; the important thing here is that the negation generates
no additional carries to propagate.
So compared to add_n, you just get an additional xor with -1 in the loop
(and not on the loop's critical path). I can't guess whether or not that
will be visible in the execution time.
For sub_n, I suppose
ldx
ldx
xnor (with %g0)
addxcc
stx
would be the right mix. This should run in 2.5 + epsilon c/l, if
properly software pipelined. 4x should give 2.75 c/l, unless they stick
some pipeline bubbles for taken branches.
For add_n, things should run 0.5 c/l faster.
(I am assuming it is a 2-way pipeline.)
--
Torbjörn
More information about the gmp-devel
mailing list