GMP 4.3 multiplication performance
tg at gmplib.org
Wed Jun 3 13:35:41 CEST 2009
nisse at lysator.liu.se (Niels Möller) writes:
Torbjorn Granlund <tg at gmplib.org> writes:
> This particular macro is not currently problem free. Let's compare
> lshift+add_n and submul_1 on some machines:
> lshift+sub_n submul_1
> athlon64 3.87 2.5
> core2 3.3 4.5
> pentium4/64 7.3 14.9
> ultrasparc 3 7.75 23
> power4/ppc970 5.0 10
So of these, submul_1 is a decent alternative only on athlon64. I was
just about to ask. Is submul_1 limited by the multiplier or the carry
recurrency? In the latter case, I imagine there's no point in
implementing a separate sub_lshift there.
The decode bandwidth limits athlon64 submul_1 performance. If decode
bandwidth were unlimited, the multiply hardware unit would limit
performance to 2 c/l (it has a repeat rate of 1/2). If the multiply
unit were infinitely powerful, carry recurrency will limit performance
to 1 c/l...
> My solution is that we write mpn_sublsh_n for machines we care about, or
> that we define a USE_SUBMUL_1_FOR_SUB_LSHIFT in tune/tuneup.c.
And then one would put
# define mpn_sub_lshift(...) mpn_submul(...)
# define HAVE_NATIVE_mpn_lshift
in gmp-impl.h, so that code using it need only check for
That seems somewhat unorthodox.
Why override HAVE_NATIVE_mpn_lshift with a new meaning (and what should
defining it to empty mean, that it exists or that it does not exist?)?
To what should mpn_sub_lshift default?
> These might be tricky to get fast on x86, though. One problem is that a
> non-constant general register shift count must reside in the
> rcx/ecx/cx/cl register, which means we need to alternate it between cnt
> and 64-cnt in the loop. (Alternatively use could use MMX/SSE shift
Annoying restriction, that. But since top bits are ignored, two
negation instructions in the loop should be sufficient to switch back
and forth between those values.
Sure possible (My experience is that wasting two registers allows freer
> It should be renamed add_sub_n or perhaps add_n_sub_n, to avoid
> breaching the naming scheme.
Or "butterfly", unless you see a need to have a large family of
functions that compute two things in parallel.
We should allow the GMP assembly madness full freedom. :-)
More information about the gmp-devel