sqrt algorithm

Sun Aug 9 16:21:06 UTC 2015

Ciao,

On Fri, August 7, 2015 9:15 am, TorbjÃ¶rn Granlund wrote:
> "Marco Bodrato" <bodrato at mail.dm.unipi.it> writes:
>
>   tune/speed-cs 1-9000 -f2 mpn_sqr mpn_sqrlo mpn_sqrmod_bnm1
>   overhead 5.84 cycles, units of 2.86e-10 secs, CPU freq 3500.08 MHz
>                 mpn_sqr     mpn_sqrlo mpn_sqrmod_bnm1
>   1              #15.58         19.88         36.04
>   2              #19.48         20.54         39.94
>   4              #51.43         56.01         81.91
>   8              139.29       #114.17        174.01
>   16             453.79       #308.36        469.08

> I assume this is with assembly sqr_basecase and C sqrlo_basecase?

Yes, on the same platform (shell.gmplib) with --disable-assembly you get
              mpn_sqr     mpn_sqrlo mpn_sqrmod_bnm1
1               24.07        #21.24         43.27
2               83.43        #24.74        108.32
4              205.25       #126.17        241.68
8              598.30       #330.63        653.30
16            2010.31      #1046.87       1700.02
...
(of course this also means C mul_1, C addmul_1 ...)

> The speedup in a range is significant.  With an assembly sqrlo_basecase
the range and speedup will improve.

With an assembly sqrlo_basecase we can speedup small sizes, but it will
not impact on bigger sizes. sqrlo_dc is not recursive, it uses plain sqr
and mullo, the range will not improve. On the other side, improving
sqrmod_basecase can speed-up the whole function.

> Curious that mpn_sqrlo is slower than both functions for 8192.  And

When the size is too big, mullo and sqrlo can not be faster than mul_n and
sqr, so they have to allocate, use the plain function, copy the relevant
part, free...

> shouldn't mpn_sqrlo and mpn_sqr be within a factor 2+epsilon of
> mpn_sqrmod_bnm1?

They respectively cost FFT(2n), and FFT(n).

Regards,
m

-- 
http://bodrato.it/papers/