tuneup.c

Jason Moxham J.L.Moxham@maths.soton.ac.uk
Thu, 6 Mar 2003 02:35:03 +0000


On Thursday 06 Mar 2003 12:37 am, Kevin Ryde wrote:
> Jason Moxham <J.L.Moxham@maths.soton.ac.uk> writes:
> > I writing a high and low part multiplication and as you are probably
> > aware the benefit at very large sizes is very small , so I have a
> > threshold to switch over to normal mpn_mul .
>
> If the lowhalf form is no slower then there's probably no need to
> switch.

I forgot to mention this , but theres another reason why I want it to swi=
tch=20
over before (say 10^6 limbs)
In the recursive "kara" form of lowpart multiplication there is another=20
parameter that needs to be tuned , so writing a "perfect" lowpart=20
multiplication function and timing it for all possible values of this=20
parameter , I got a large table of this "perfect parameter" which I fitte=
d a=20
curve to. Now back in the real world I use this curve to estimate what va=
lue=20
of this parameter I should use for each size of input. The real world low=
part=20
mul requires this parameter to satisfy certain bounds , which the curve d=
oes=20
over a large range of the input values , but not all.This is because the=20
curve I guessed is not the true curve but just a good fit over the ranges=
 I=20
tested. So to verify that the lowpart mul function will produce correct=20
results I need to check that the curve satisfys the conditions over all=20
values that will be used , so in tests/t-mullowhalf.c I just check all=20
possible values , and if I can restrict the threshold to say 10^6 limbs t=
hen=20
this tests runs <1 sec . It a bit of a hack I know , but it seems pratica=
l.=20
Note this test has to be run each time any of the different mul threshold=
s=20
changes , eg base,kara,toom,fft , as the condition and choice of paramete=
r=20
depends on all these "varibles".

One interesting observation I did make , which complicates the above , is=
 that=20
the choice of parameter , which trades multiplications of certain sizes f=
or=20
another set of multiplications with different sizes , is that the=20
multiplication algorithms (kara,toom3,fft) tend to dominate the actuall=20
choice of parameter . In effect the best parameter to choose is one that=20
matches the choice of multiplication algorithm.

eg

Assuming kara threshold =3D25 and toom3 threshold =3D177

Say we want lowpart mul of 787 limb sized number
theorectically we would split this into a mul of 580 limbs and two low pa=
rts=20
of 207 limbs , however because the implementation of multiplication is=20
important we get
580 > 177           =3D> toom3
580/3=3D193 > 177   =3D> toom3
580/9=3D64 > 25      =3D> kara
580/18=3D32 > 25    =3D> kara
580/36=3D16 > 1     =3D> basecase

So we should instead choose a multiple of 36 closest to 580  ie 576
So the fastest way to calculate the low part of a 787 limb number is to s=
plit=20
it into a 576 limb mul and two 211 limb low parts

This gives me another a few percent speedup.You can see why this is likel=
y to=20
true as the split into 576,211 is very uneven and therefore the "first" m=
ul=20
"dominates" the time  . It also matches very well what the "perfect" lowp=
art=20
multiplication what returning.

What I mean by the "perfect" lowpart multiplication is that=20
assume we have the fastest lowpart mul for all sizes <n
then to construct the faster lowpart mul for size n , we time the functio=
n for=20
all choices of parameter  and find the fastest one , store this result in=
 a=20
big array and increase n to get the fastest mul for all n (within limits!=
 )

>
> > However the tuneup program can not detect this
> > crossover very well. Is there some sort of fudge factor (eg like
> > function_fudge in struct param_t ) so that I can tell the tuneup prog=
ram
> > to switch over when mpn_mullowhalf is about within 1% of mpn_mul .
>
> Hmm.  Not really.  Maybe stop_factor is close to what you need.  If
> it's 1.0 or just above 1.0 then I think the looping will end pretty
> much as soon as method A is slower than method B.  Combine that with
> the function_fudge to stop when A is within a certain factor of B.
>

Sounds like the stuff I need

> If the main loop in "one" doesn't suit then you can always do your own
> thing with tuneup_measure.  tune_jacobi_base is a simple example of
> that, "fft" is a hairy example.
>
> > I really need
> > something like this because for even larger sizes the speeds of the t=
wo
> > functions are always about the same , so no definite crossover can be
> > detected. This is more important for the highpart multiplication than=
 the
> > lowpart , as the probability of a detectible error increases with siz=
e.
> > Note the "crossover" sizes are approx 16000 to 30000 limbs , ie well =
in
> > fft ranges
>
> Big sizes can be time consuming to probe, and I'm not sure accuracy
> and reproducibility are terrific when time periods start to get big.
> If there's not a lot of difference at the point your interested in
> then I don't suppose it needs to be located too accurately.

Yes , accuracy is not too important , just so I dont have to do it by han=
d

> _______________________________________________
> gmp-discuss mailing list
> gmp-discuss@swox.com
> https://gmplib.org/mailman/listinfo/gmp-discuss