<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <HTML><HEAD><TITLE>Re: Factorial improvements</TITLE> <META http-equiv=Content-Type content="text/html; charset=iso-8859-1"> <META content="MSHTML 6.00.2900.2722" name=GENERATOR></HEAD> <BODY> <DIV>>Date: Wed, 14 Sep 2005 22:19:39 +0200<BR>>From: Marsac Laurent <laurent.marsac@esil.univ-mrs.fr><BR>><BR>>Hello,<BR>><BR>>I'm currently trying to improve n! calculation (mpz/fac_ui.c) and i just<BR>>want to know if anyone else is already working on it, or who is<BR>>responsible of that file ?<BR>><BR>>As described in the TODO list, i'm doing the prime factorization of n!,<BR>>and shiftings for factors of 2. The first results seems to show my<BR>>algorithm improves performances and i hope i could send more details<BR>>here ASAP.<BR>><BR>>--<BR>>Laurent Marsac<BR></DIV> <DIV>With OpenPFGW, we wrote our own n! function.  Since we use our own </DIV> <DIV>efficient C++ wrapper class around the raw gmp, I will not post code,  </DIV> <DIV>since it may make reading it harder to read by members of this list, along</DIV> <DIV>with the fact that the function call is generic and handles several other things</DIV> <DIV>beyond a simple n! (such as multi-factorial and modular factorials), thus </DIV> <DIV>again making it more difficult to read. </DIV> <DIV> </DIV> <DIV>The best performance speedup's we found are due to 2 things.</DIV> <DIV> </DIV> <DIV>1.  pack as many multiplies together in single precision mode (i.e. multiply</DIV> <DIV>multiple small numbers into a 32 (or 64) bit machine value, and then when</DIV> <DIV>that becomes full, multiply tha into the gmp number.  We pre-computed</DIV> <DIV>a table of the number of values we could multiply together (without overflow),</DIV> <DIV>and where the "break over" points were, where one less number at a time</DIV> <DIV>must be used.</DIV> <DIV> </DIV> <DIV>2. split up the operation into a set of GMP accumulator numbers, then at</DIV> <DIV>the end multiply all the accumulators together.  We chose a constant 4 </DIV> <DIV>accumulators and seemed to work out well.  All the work we did was </DIV> <DIV>in the mpz level.  We did not investigate using the mpn level and doing</DIV> <DIV>multbyonelimb type calls.  It is possible that a tuned mpn version would</DIV> <DIV>outperform the split accumulator mpz method, however, I think that </DIV> <DIV>splitting up this iteration of multiplies into multiple accumulators is a</DIV> <DIV>good way to improve the performance.</DIV> <DIV> </DIV> <DIV>The first optimization avoids the multi-precision mults as much as possible.</DIV> <DIV>The second gives a Karatsuba like divide and conquer affect.</DIV> <DIV> </DIV> <DIV>I do not remember the exact speedup we achieved over gmp's mpz/fac_ui.c</DIV> <DIV>code, but if I remember right, it was a significant improvement. (orders of </DIV> <DIV>magnitude, and reduced the big-O cost, thus when the numbers grew in</DIV> <DIV>size, this method did better and better).  What we did in OpenPFGW was</DIV> <DIV>never "tuned".  It was simply hacked together to get a significant speedup,</DIV> <DIV>and then left at that (a "good enough" improvement)</DIV> <DIV> </DIV> <DIV>Jim.</DIV></BODY></HTML> <BR><BR><font <FONT face="Arial, Helvetica, sans-serif" size="2"><a href="http://www.trayhelper.com" target="_th"><img src="cid:th_emoticon_th_.gif" border="0"></a>  <a href="http://www.trayhelper.com" target="_th"><FONT face="Comic Sans MS" size=2>Tray Helper</font></a> - try to find something more useful...</font>