<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>Re: Factorial improvements</TITLE>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<META content="MSHTML 6.00.2900.2722" name=GENERATOR></HEAD>
<BODY>
<DIV>>Date: Wed, 14 Sep 2005 22:19:39 +0200<BR>>From: Marsac Laurent
<laurent.marsac@esil.univ-mrs.fr><BR>><BR>>Hello,<BR>><BR>>I'm
currently trying to improve n! calculation (mpz/fac_ui.c) and i just<BR>>want
to know if anyone else is already working on it, or who is<BR>>responsible of
that file ?<BR>><BR>>As described in the TODO list, i'm doing the prime
factorization of n!,<BR>>and shiftings for factors of 2. The first results
seems to show my<BR>>algorithm improves performances and i hope i could send
more details<BR>>here ASAP.<BR>><BR>>--<BR>>Laurent Marsac<BR></DIV>
<DIV>With OpenPFGW, we wrote our own n! function. Since we use our own
</DIV>
<DIV>efficient C++ wrapper class around the raw gmp, I will not post code,
</DIV>
<DIV>since it may make reading it harder to read by members of this list,
along</DIV>
<DIV>with the fact that the function call is generic and handles several other
things</DIV>
<DIV>beyond a simple n! (such as multi-factorial and modular factorials), thus
</DIV>
<DIV>again making it more difficult to read. </DIV>
<DIV> </DIV>
<DIV>The best performance speedup's we found are due to 2 things.</DIV>
<DIV> </DIV>
<DIV>1. pack as many multiplies together in single precision mode (i.e.
multiply</DIV>
<DIV>multiple small numbers into a 32 (or 64) bit machine value, and then
when</DIV>
<DIV>that becomes full, multiply tha into the gmp number. We
pre-computed</DIV>
<DIV>a table of the number of values we could multiply together (without
overflow),</DIV>
<DIV>and where the "break over" points were, where one less number at a
time</DIV>
<DIV>must be used.</DIV>
<DIV> </DIV>
<DIV>2. split up the operation into a set of GMP accumulator numbers, then
at</DIV>
<DIV>the end multiply all the accumulators together. We chose a constant 4
</DIV>
<DIV>accumulators and seemed to work out well. All the work we did was
</DIV>
<DIV>in the mpz level. We did not investigate using the mpn level and
doing</DIV>
<DIV>multbyonelimb type calls. It is possible that a tuned mpn
version would</DIV>
<DIV>outperform the split accumulator mpz method, however, I think that </DIV>
<DIV>splitting up this iteration of multiplies into multiple accumulators is
a</DIV>
<DIV>good way to improve the performance.</DIV>
<DIV> </DIV>
<DIV>The first optimization avoids the multi-precision mults as much as
possible.</DIV>
<DIV>The second gives a Karatsuba like divide and conquer affect.</DIV>
<DIV> </DIV>
<DIV>I do not remember the exact speedup we achieved over gmp's
mpz/fac_ui.c</DIV>
<DIV>code, but if I remember right, it was a significant improvement. (orders of
</DIV>
<DIV>magnitude, and reduced the big-O cost, thus when the numbers grew in</DIV>
<DIV>size, this method did better and better). What we did in OpenPFGW
was</DIV>
<DIV>never "tuned". It was simply hacked together to get a significant
speedup,</DIV>
<DIV>and then left at that (a "good enough" improvement)</DIV>
<DIV> </DIV>
<DIV>Jim.</DIV></BODY></HTML>
<BR><BR><font <FONT face="Arial, Helvetica, sans-serif" size="2"><a href="http://www.trayhelper.com" target="_th"><img src="cid:th_emoticon_th_.gif" border="0"></a> <a href="http://www.trayhelper.com" target="_th"><FONT face="Comic Sans MS" size=2>Tray Helper</font></a> - try to find something more useful...</font>