GMP 4.3 multiplication performance

Wed Jun 3 13:35:41 CEST 2009

nisse at lysator.liu.se (Niels Möller) writes:

  Torbjorn Granlund <tg at gmplib.org> writes:

  > This particular macro is not currently problem free.  Let's compare
  > lshift+add_n and submul_1 on some machines:
  >
  >              lshift+sub_n            submul_1
  > athlon64        3.87                     2.5
  > core2           3.3                      4.5
  > pentium4/64     7.3                     14.9
  > ultrasparc 3    7.75                    23
  > power4/ppc970   5.0                     10

  So of these, submul_1 is a decent alternative only on athlon64. I was
  just about to ask. Is submul_1 limited by the multiplier or the carry
  recurrency? In the latter case, I imagine there's no point in
  implementing a separate sub_lshift there.

The decode bandwidth limits athlon64 submul_1 performance.  If decode
bandwidth were unlimited, the multiply hardware unit would limit
performance to 2 c/l (it has a repeat rate of 1/2).  If the multiply
unit were infinitely powerful, carry recurrency will limit performance
to 1 c/l...

  > My solution is that we write mpn_sublsh_n for machines we care about, or
  > that we define a USE_SUBMUL_1_FOR_SUB_LSHIFT in tune/tuneup.c.

  And then one would put

  #if USE_SUBMUL_1_FOR_SUB_LSHIFT
  # define mpn_sub_lshift(...) mpn_submul(...)
  # define HAVE_NATIVE_mpn_lshift
  #endif

  in gmp-impl.h, so that code using it need only check for
  HAVE_NATIVE_mpn_lshift, right?

That seems somewhat unorthodox.

Why override HAVE_NATIVE_mpn_lshift with a new meaning (and what should
defining it to empty mean, that it exists or that it does not exist?)?

To what should mpn_sub_lshift default?

  > These might be tricky to get fast on x86, though.  One problem is that a
  > non-constant general register shift count must reside in the
  > rcx/ecx/cx/cl register, which means we need to alternate it between cnt
  > and 64-cnt in the loop.  (Alternatively use could use MMX/SSE shift
  > operation.)

  Annoying restriction, that. But since top bits are ignored, two
  negation instructions in the loop should be sufficient to switch back
  and forth between those values.

Sure possible (My experience is that wasting two registers allows freer
OoO execution.)

  > It should be renamed add_sub_n or perhaps add_n_sub_n, to avoid
  > breaching the naming scheme.

  Or "butterfly", unless you see a need to have a large family of
  functions that compute two things in parallel.

We should allow the GMP assembly madness full freedom.  :-)

-- 
Torbjörn