GMP 4.3 multiplication performance

Wed Jun 3 12:09:21 CEST 2009

bodrato at mail.dm.unipi.it writes:

  > it unconditionally, instead of lshift.  Optionally, use sublsh_n if
  > HAVE_NATIVE_mpn_sublsh_n and submul_1 otherwise.

  #if HAVE_NATIVE_mpn_sublsh_n
  #define DO_mpn_sublsh_n(dst,src,n,s) mpn_sublsh_n(dst,src,n,s)
  #else
  #define DO_mpn_sublsh_n(dst,src,n,s) mpn_addmul_1(dst,src,n,CNST_LIMB(1) <<s)
  #endif

  Correct?

I'dg sugest to use simply sublsh_n instead of DO_mpn_sublsh_n.

  If it is, can something like this be included in gmp-impl.h, so that we
  can start using sublsh without the need to care about native or non-native
  sublsh in generic code?

I hesitate to add even more macros to gmp-impl.h, it is easy enough to
define things like this at the beginning of files that need them.  Also,
abstracting away details is not always good, since it might hide the
details we need for writing really efficient code.

This particular macro is not currently problem free.  Let's compare
lshift+add_n and submul_1 on some machines:

             lshift+sub_n            submul_1
athlon64        3.87                     2.5
core2           3.3                      4.5
pentium4/64     7.3                     14.9
ultrasparc 3    7.75                    23
power4/ppc970   5.0                     10

My solution is that we write mpn_sublsh_n for machines we care about, or
that we define a USE_SUBMUL_1_FOR_SUB_LSHIFT in tune/tuneup.c.

  > 	addlsh1	(a + 2b)	addlsh	(a + (2^c)b)	2/1.5
  > 	sublsh1	(a - 2b)	sublsh	(a - (2^c)b)	2.21/1.7

  Both are really desired for Toom evaluation and interpolation.

These might be tricky to get fast on x86, though.  One problem is that a
non-constant general register shift count must reside in the
rcx/ecx/cx/cl register, which means we need to alternate it between cnt
and 64-cnt in the loop.  (Alternatively use could use MMX/SSE shift
operation.)

  > 	rsbrsh1	(a/2 - b)	rsbrsh	(a/(2^c) - b)	-/1.5

  This too, may have interesting applications, particularly for a little
  program I'm (slowly) working on... I adopted a different strategy because
  of the lack of such a function.

I believe rsbrsh1 would be simple to get fast even on x86.

  > 	rsh1add	(a + b)/2	rshadd	(a + b)/(2^c)	2.14/1.5
  > 	rsh1sub	(a - b)/2	rshsub	(a - b)/(2^c)	2.14/1.5

  rsh1(add|sub) are already used in both interpolation_5pts and _7pts!

Indeed.  And they are implemented already in GMP 4.3 for several
machines.

  > 	addadd  (a + b + c)				-/2
  > 	addsub  (a + b - c)				-/2

  ...addsub already exists with another meaning: it computes
  (a,b)<-(a+b,a-b), should I assume it will be removed?

It should be renamed add_sub_n or perhaps add_n_sub_n, to avoid
breaching the naming scheme.

-- 
Torbjörn