GMP 4.3 multiplication performance

Wed Jun 3 13:19:55 CEST 2009

Torbjorn Granlund <tg at gmplib.org> writes:

> This particular macro is not currently problem free.  Let's compare
> lshift+add_n and submul_1 on some machines:
>
>              lshift+sub_n            submul_1
> athlon64        3.87                     2.5
> core2           3.3                      4.5
> pentium4/64     7.3                     14.9
> ultrasparc 3    7.75                    23
> power4/ppc970   5.0                     10

So of these, submul_1 is a decent alternative only on athlon64. I was
just about to ask. Is submul_1 limited by the multiplier or the carry
recurrency? In the latter case, I imagine there's no point in
implementing a separate sub_lshift there.

> My solution is that we write mpn_sublsh_n for machines we care about, or
> that we define a USE_SUBMUL_1_FOR_SUB_LSHIFT in tune/tuneup.c.

And then one would put

#if USE_SUBMUL_1_FOR_SUB_LSHIFT
# define mpn_sub_lshift(...) mpn_submul(...)
# define HAVE_NATIVE_mpn_lshift
#endif

in gmp-impl.h, so that code using it need only check for
HAVE_NATIVE_mpn_lshift, right?

> These might be tricky to get fast on x86, though.  One problem is that a
> non-constant general register shift count must reside in the
> rcx/ecx/cx/cl register, which means we need to alternate it between cnt
> and 64-cnt in the loop.  (Alternatively use could use MMX/SSE shift
> operation.)

Annoying restriction, that. But since top bits are ignored, two
negation instructions in the loop should be sufficient to switch back
and forth between those values.

> It should be renamed add_sub_n or perhaps add_n_sub_n, to avoid
> breaching the naming scheme.

Or "butterfly", unless you see a need to have a large family of
functions that compute two things in parallel.

/Niels