"balanced" unbalanced toom22

Thu Aug 7 12:54:50 CEST 2008

David,

> I tried writing another version of mpn_mul_toom22(), based on TG's  
> code from http://gmplib.org/devel/mpn/generic/mul_toom22.c.
> 
> The idea is to break up the inputs as e.g.
> 
> AAAAAA|AAAAAA
> ==BBBB|BBBB==
> 
> instead of
> 
> AAAAAA|AAAAAA
> ====BB|BBBBBB

nice idea, I already saw it somewhere...

> So for example, if the input lengths are 2n >= 2m, then instead of  
> splitting into n*n and n*n and n*(2m-n) products, it gets split into  
> n*n and n*m and n*m.
> 
> I was hoping that the "better balancing" of the subproducts would  
> make things faster.
> 
> Well, it turns out to be slower, for reasons I only partially  
> understand. TG suggested it might be useful for me to post the code  
> here anyway. (Warning: the code is a bit messy.)
> 
> I don't think it's just because the algorithm is more complicated and  
> the overhead larger.
> 
> Take for example the fairly extreme case 200x120. (This would  
> probably be handled by toom32, but just as an example.)
> 
> The existing toom22 breaks it up as 100x100 + 100x100 + 100x20. My  
> experimental code does it as 100x100 + 100x60 + 100x60.
> 
> The problem is that both 100x60's get only a little benefit from  
> karatsuba (threshold is 46 on this machine), whereas 100x100 gets  
> quite a lot, more than the two 100x60's put together. So the first  
> method wins even though 100x20 doesn't get to take advantage of  
> karatsuba at all. I suspect the same thing happens at any size.

I could not run the attached code since the timing.h file is missing.

I suggest you first try to compare the time of 100x100 + 100x100 + 100x20
wrt 100x100 + 100x60 + 100x60 using the tune/speed routine. It is easy to
modify it to tune the mpn_mul function, using the extra parameter .r to
give the 2nd size (patch below). I get for example on a Core 2:

tiramisu% ./speed -s 100 mpn_mul.100 mpn_mul.60 mpn_mul.20
          mpn_mul.100    mpn_mul.60    mpn_mul.20
100       0.000020801   0.000015202  #0.000005396

this would give 0.000046998 for toom22, and 0.000051205 for your modified code.

On an Opteron I get:

achille% ./speed -s 100 mpn_mul.100 mpn_mul.60 mpn_mul.20
          mpn_mul.100    mpn_mul.60    mpn_mul.20
100       0.000013052   0.000010390  #0.000003529

this would give 0.000029633 for toom22, and 0.000033832 for your modified code.

Thus your results are not surprising. My guess is that you are too near the
quadratic region, where the modified cutting should not win. Also, toom22
performs only one unbalanced product, whereas the modified cutting performs
two, and I believe mpn_mul currently behaves much better for balanced
multiplication.

Paul

--- speed.c.orig	2008-08-07 11:43:06.000000000 +0200
+++ speed.c	2008-08-07 11:43:54.000000000 +0200
@@ -152,6 +152,7 @@
   { "noop_wxys",         speed_noop_wxys            },
 
   { "mpn_add_n",         speed_mpn_add_n,     FLAG_R_OPTIONAL },
+  { "mpn_mul",           speed_mpn_mul,       FLAG_R_OPTIONAL },
   { "mpn_sub_n",         speed_mpn_sub_n,     FLAG_R_OPTIONAL },
 
 #if HAVE_NATIVE_mpn_addsub_n
--- speed.h.orig	2008-08-07 11:43:10.000000000 +0200
+++ speed.h	2008-08-07 11:45:16.000000000 +0200
@@ -145,6 +145,7 @@
 double speed_mpf_init_clear _PROTO ((struct speed_params *s));
 
 double speed_mpn_add_n _PROTO ((struct speed_params *s));
+double speed_mpn_mul   _PROTO ((struct speed_params *s));
 double speed_mpn_addlsh1_n _PROTO ((struct speed_params *s));
 double speed_mpn_addsub_n _PROTO ((struct speed_params *s));
 double speed_mpn_and_n _PROTO ((struct speed_params *s));
@@ -659,6 +660,40 @@
   }
 
 /* For mpn_add_n, mpn_sub_n, or similar. */
+#define SPEED_ROUTINE_MPN_MUL(call)				\
+  {									\
+    mp_ptr     wp;							\
+    mp_ptr     xp, yp;							\
+    unsigned   i;							\
+    double     t;							\
+    TMP_DECL;								\
+									\
+    SPEED_RESTRICT_COND (s->size >= 1);					\
+    SPEED_RESTRICT_COND (s->size >= s->r);				\
+									\
+    TMP_MARK;								\
+    SPEED_TMP_ALLOC_LIMBS (wp, s->size + s->r, s->align_wp);		\
+									\
+    xp = s->xp;								\
+    yp = s->yp;								\
+									\
+    speed_operand_src (s, xp, s->size);					\
+    speed_operand_src (s, yp, s->size);					\
+    speed_operand_dst (s, wp, s->size);					\
+    speed_cache_fill (s);						\
+									\
+    speed_starttime ();							\
+    i = s->reps;							\
+    do									\
+      call;								\
+    while (--i != 0);							\
+    t = speed_endtime ();						\
+									\
+    TMP_FREE;								\
+    return t;								\
+  }
+
+/* For mpn_add_n, mpn_sub_n, or similar. */
 #define SPEED_ROUTINE_MPN_ADDSUB_N_CALL(call)				\
   {									\
     mp_ptr     ap, sp;							\
--- common.c.orig	2008-08-07 11:43:18.000000000 +0200
+++ common.c	2008-08-07 11:48:54.000000000 +0200
@@ -810,6 +810,11 @@
   SPEED_ROUTINE_MPN_BINARY_N_CALL (mpn_and_n (wp, s->xp, s->yp, s->size));
 }
 double
+speed_mpn_mul (struct speed_params *s)
+{
+  SPEED_ROUTINE_MPN_MUL (mpn_mul (wp, s->xp, s->size, s->yp, s->r));
+}
+double
 speed_mpn_andn_n (struct speed_params *s)
 {
 SPEED_ROUTINE_MPN_BINARY_N_CALL (mpn_andn_n (wp, s->xp, s->yp, s->size));