Pentium 4 SSE2 challenge

Torbjorn Granlund tg at swox.com
Fri Mar 18 02:51:16 CET 2005


GMP doesn't run as fast as it could on Pentium 4.  The main
problem is that mpn_addmul_1 runs relatively slowly, at 5.5 or
6.0 cycles/limb (depending on P4 model).

It seems possible to shave off a cycle from these numbers,
reaching 4.5 and 5 cycles/limb, respectively.  But that's still
not great.  We should go for mpn_addmul_2 instead as the main
multiplication function.

mpn_addmul_2 is defined like this:

  mp_limb_t
  mpn_addmul_2 (mp_ptr rp, mp_srcptr up, mp_size_t n, mp_srcptr vp)
  {
    rp[n] = mpn_addmul_1 (rp, up, n, vp[0]);
    return mpn_addmul_1 (rp, up, n, vp[1]);
  }

Are there any Pentium 4 assembly hackers out there?  Do you have
some spare time to play with this over Easter?

Below is what I came up with.  It runs at 7.5 cycles/limb, which
is of course slower than mpn_addmul_1.  It surely could be
improved to the point where it vastly outperforms mpn_addmul_1.

That is my challenge to the list: Write an mpn_addmul_2 that runs
at 4 cycles/limb or better.  I suspect 3 cycles/limb might be
possible.  Your tools: (1) An MMX/SSE2 assembly manual (from
www.amd.com or www.intel.com), (2) Loop unrolling, (3) Software
pipelining.

The reward is that lots of people with these roaster chips will
become grateful to you.  :-)

If there are any takers, I will include the result in GMP 4.2.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: p4a-addmul_2.asm
Type: application/octet-stream
Size: 2828 bytes
Desc: not available
Url : http://gmplib.org/list-archives/gmp-devel/attachments/20050318/414789ed/p4a-addmul_2.obj
-------------- next part --------------

Test code.  Compile with -DN=2 to get mpn_addmul_2 testing, pass
CPU clock in Hertz as -DCLOCK=XXX, pass operand size as e.g.,
-DSIZE=50.  Sample compilation command line (assuming pwd is at
the top of a gmp-4.1.4 source tree where a fresh build has been
made, and that the two attached files are there too):

(cd mpn; m4 ../addmul_2.asm) >x.s && gcc -O -I. x.s addmul_N.c -DN=2 -DCLOCK=XXX tests/.libs/libtests.a .libs/libgmp.a -DSIZE=50 && ./a.out 2

-------------- next part --------------
A non-text attachment was scrubbed...
Name: addmul_N.c
Type: application/octet-stream
Size: 6077 bytes
Desc: not available
Url : http://gmplib.org/list-archives/gmp-devel/attachments/20050318/414789ed/addmul_N.obj
-------------- next part --------------

--
Torbj?rn


More information about the gmp-devel mailing list