multi-threaded multiplication speed-up

Tue Jan 16 20:44:07 CET 2007

On Mon, Jan 15, 2007 at 03:34:59PM -0600, Linas Vepstas wrote:
> On Thu, Jan 11, 2007 at 06:01:54PM -0500, Jason Martin wrote:
> > Hi All,
> > 
> > Out of curiosity, I recently tried a naive multi-threaded
> > multiplication scheme (using pthreads) and got a speed-up of about 25%
> > for extremely large operands (over 2000 limbs).  I was just using
> > simple "textbook" style multiplication (break the larger operand in
> > half, do two multiplies using mpn_mul, then shift/add back together).
> > I got similar results on a Woodcrest running OS X and a dual core
> > Opteron running Linux.
> 
> It seems to me that the extra overhead of re-assembling the result 
> might be wasteful; my gut impression is that a better parallelization
> scheme would be to parallelize independent operations, in the 
> style of "dataflow", of using an input only when its available and
> ready.
> 
> Thus, for example, suppose the user application was this:
>   mpf_mul (a,b,c);
>   mpf_mul (d,e,f);
>   mpf_add (g,a,d);
> 
> The implementation would use "scoreboarding" to determine when
> a variable is ready to be used, and to dispatch threads. Thus,
> the actual code path, under the covers, would be:
>   mpf_mul (a,b,c) {
>     if ((b is ready) and (c is ready)) {
>        mark a not ready;
>        start thread to perform actuall multiplication
>        return to user;
>     }
>     else
>     {
>        launch thread that waits for b and c to be ready;
>        return to user.
>     }
>   }
> 
> while the multplication thread would mark variable "a" as ready 
> when it was done multiplying.
>     
> This style of parallelizing is called "dataflow", and is tried & true
> in certain problem domains.
> 
> The goal of parallelizing this way is that it avoids the overhead
> of re-assembling the result from pieces, and being coarse-grained,
> has less ovehead in general.
> 
> But I dunno; I've not ever tried this; this is just off the top 
> of my head.

This kind of parallelization at the application level looks indeed vastly
superior to having the parallelization at the level of the library. If an
algorithm allows it (and this is often the case), this is the right thing
to do. On the other hand, there are also algorithms for which the
multiprecision operations are inherently sequential. Computing digits of
Pi is one example. For those algorithms, a multithreaded GMP would
probably help. Another application I can see is for computer algebra
systems like gp, Maple or Magma.  In interactive mode, it is common that
the user asks for the result of an isolated (large) operation; since
there is a user waiting, using all the available cores would be a good
idea.

Pierrick.