multi-threaded multiplication speed-up

Mon Jan 15 22:34:59 CET 2007

On Thu, Jan 11, 2007 at 06:01:54PM -0500, Jason Martin wrote:
> Hi All,
> 
> Out of curiosity, I recently tried a naive multi-threaded
> multiplication scheme (using pthreads) and got a speed-up of about 25%
> for extremely large operands (over 2000 limbs).  I was just using
> simple "textbook" style multiplication (break the larger operand in
> half, do two multiplies using mpn_mul, then shift/add back together).
> I got similar results on a Woodcrest running OS X and a dual core
> Opteron running Linux.

It seems to me that the extra overhead of re-assembling the result 
might be wasteful; my gut impression is that a better parallelization
scheme would be to parallelize independent operations, in the 
style of "dataflow", of using an input only when its available and
ready.

Thus, for example, suppose the user application was this:
  mpf_mul (a,b,c);
  mpf_mul (d,e,f);
  mpf_add (g,a,d);

The implementation would use "scoreboarding" to determine when
a variable is ready to be used, and to dispatch threads. Thus,
the actual code path, under the covers, would be:
  mpf_mul (a,b,c) {
    if ((b is ready) and (c is ready)) {
       mark a not ready;
       start thread to perform actuall multiplication
       return to user;
    }
    else
    {
       launch thread that waits for b and c to be ready;
       return to user.
    }
  }

while the multplication thread would mark variable "a" as ready 
when it was done multiplying.

This style of parallelizing is called "dataflow", and is tried & true
in certain problem domains.

The goal of parallelizing this way is that it avoids the overhead
of re-assembling the result from pieces, and being coarse-grained,
has less ovehead in general.

But I dunno; I've not ever tried this; this is just off the top 
of my head.

--linas