# Is there a platform optimized version of gmp for INTEL?

Michael comtech.usa at gmail.com
Tue Jan 2 08:10:04 CET 2007

```Dear Decio,

Indeed, I am dealing with addition/subtraction/multiplication/division and
exp/log with extremely large numbers or numbers for which huge precision
needs to be preserved using GMP/MPFR. I need to identify which portion of
the code in GMP/MPFR, if I rewrite using Assembly language and focusing on
optimization, will lead me to the most performance gain. I really need a
TREMENDOUS speed up.

You questioned that why I dodged optimization on the algorithm level. Here
is the reason. I am dealing just one expression(complicated), but with one
expression, my task is to implement it and to squeeze as much performance as
possible. I don't see how I can optimize from the algorithm level, other
than the vectorization which I've already used in Matlab.

The whole expression is kind of straighforward, so I provide it below, using
Matlab to make it easy to demonstrate. I attach my program below. As you can
see, it involves addition/subtraction/multiplication/division and exp/log
with extremely large numbers or numbers for which huge precision needs to be
preserved. The range of k and n in the program can be hundreds. That's to
say we are dealing with numbers of the order of gamma(hundreds), due to the
binomial numbers used. I've used Matlab's symbolic computation. All the
variables, except n, are not integers or rationals.

I've used Matlab's vectorized computation to achieve faster speed. But
still, it's slow, because of the symbolic computations. However, we have to
maintain super high precisions -- we did a test, if we change some variables
from symbolic to double floating point, then we are in huge trouble, the
result is way off. For this reason, the variables we send in to this
function are all in symbolic form. The large binomial coefficients are
obtained using Maple's binomial function.

I have converted this program into GMP/MPFR and I've found some performance
gain. But still the speed is far from our demand. There two troublesome
issues: (1) we have to manually set the number of precision bits by ourself
for each set of parameters. As you can see there are many parameters
involved. Their different combination requires different number of precision
bits in order to get the correct results. If we are doing conservative, we
set the number of precision bits to be 10240 to guarantee the correct
results for "almost all" parameter sets. Then it is extremely slow. (2) We
have to optimize the code and make it faster by a speedup of hundreds to
thousands times.

Based on my description, could you give me some more detailed pointers? So I
can start getting into solve the problem?

Thanks a lot and happy new year!!!

-------------------------
function p=MyIntegrate(t, s, c, de, n, v_t, ka, mu, si);
R_t=c/de;
for k=0:n;
pr(k+1)=maple('binomial', R_t+k-1, k);
end;

for i=1:length(s)
la=MyL(de*([0:n]+R_t), t, s(i), v_t, ka, mu, si);
for k=0:n
y=alter(k,la);
p(k+1)=pr(k+1)*y;
end
end;

return;

function res=alter(k,a)
for m=0:k
s(m+1)=maple('binomial', sym(k), sym(m));
end
res=s*((-1).^[0:k].*a(1:k+1))';
return;

function L=MyL(u,t,s,v_t,ka,mu,si)
ga=0.5*sqrt(ka^2+2*u*si^2);
temp1=ga*(s-t);
temp2=sinh(temp1);
den=ga.*cosh(temp1)+0.5*ka*temp2;
be=-u.*temp2./den;
al=2*ka*mu/si^2*(0.5*ka*(s-t)+log(ga./den));
L=exp(al+be*v_t);
return;

On 1/1/07, Décio Luiz Gazzoni Filho < decio at decpp.net> wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
> On Jan 1, 2007, at 8:13 PM, Michael wrote:
>
> > Does anybody have a statistic about how much performance is gained
> > by using Intel Compiler and Intel tools, than MSVS 2005?
>
> Maybe 2 or 3%. If you got 5% I'd be absolutely astonished.
>
> >
> > Also if I pay big bucks and hire some experts to rewrite some
> > portions of the GMP and MPFR (only a small portion that is related
> > to my project, to make the whole thing manageable) in a Intel
> > platform optimized Assembly language. Will that improve speed a lot?
> >
>
> Depends on what part you need to be faster. If your bottleneck is the
> multiplication with huge arguments (say hundreds of millions or
> larger), then hiring someone to write a faster FFT (or an NTT, at
> that size) would bring about some large gains. If your bottleneck is
> the division code, or some of the other code that could be using
> Newton methods along with subquadratic multiplication, then the
> improvements could be huge. But unless you specify what specific
> operation is your bottleneck, then it's impossible to give an answer.
> Different parts of GMP are at different levels of optimization.
>
> Now one thing has been asked, but you dodged the question: have you
> ruled out all possibilities of algorithmic optimization? Low-level
> optimization can only get you so far, and usually not as far as
> algorithmic optimization will. You could write the best grammar-
> school multiply code possible, but in the millions of digits range it
> would take a severe beating from even the most simple-minded and
> badly written FFT possible. And even if you're done with algorithmic
> optimization, have you profiled your code to find the true
> bottlenecks and where to focus your efforts on?
>
> If at all possible, please shed some light on what you're trying to
> do. You appear biased towards low-level optimization, but others
> could help point out better strategies, if they only knew what it is
> you're trying to accomplish.
>
> Décio
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.5 (Darwin)
>
> iD8DBQFFma+g9zcAVrR+ETURAs0zAJ46+m/YAHHRrk9michzSFlLzxL5OwCeIVTU
> SUVdvXiDUJzGT3RXaNAxQL0=
> =tmyf
> -----END PGP SIGNATURE-----
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://gmplib.org/list-archives/gmp-discuss/attachments/20070101/19824ba2/attachment-0001.html
```