sqrt algorithm
Marco Bodrato
bodrato at mail.dm.unipi.it
Sun Aug 9 16:21:06 UTC 2015
Ciao,
On Fri, August 7, 2015 9:15 am, Torbjörn Granlund wrote:
> "Marco Bodrato" <bodrato at mail.dm.unipi.it> writes:
>
> tune/speed-cs 1-9000 -f2 mpn_sqr mpn_sqrlo mpn_sqrmod_bnm1
> overhead 5.84 cycles, units of 2.86e-10 secs, CPU freq 3500.08 MHz
> mpn_sqr mpn_sqrlo mpn_sqrmod_bnm1
> 1 #15.58 19.88 36.04
> 2 #19.48 20.54 39.94
> 4 #51.43 56.01 81.91
> 8 139.29 #114.17 174.01
> 16 453.79 #308.36 469.08
> I assume this is with assembly sqr_basecase and C sqrlo_basecase?
Yes, on the same platform (shell.gmplib) with --disable-assembly you get
mpn_sqr mpn_sqrlo mpn_sqrmod_bnm1
1 24.07 #21.24 43.27
2 83.43 #24.74 108.32
4 205.25 #126.17 241.68
8 598.30 #330.63 653.30
16 2010.31 #1046.87 1700.02
...
(of course this also means C mul_1, C addmul_1 ...)
> The speedup in a range is significant. With an assembly sqrlo_basecase
the range and speedup will improve.
With an assembly sqrlo_basecase we can speedup small sizes, but it will
not impact on bigger sizes. sqrlo_dc is not recursive, it uses plain sqr
and mullo, the range will not improve. On the other side, improving
sqrmod_basecase can speed-up the whole function.
> Curious that mpn_sqrlo is slower than both functions for 8192. And
When the size is too big, mullo and sqrlo can not be faster than mul_n and
sqr, so they have to allocate, use the plain function, copy the relevant
part, free...
> shouldn't mpn_sqrlo and mpn_sqr be within a factor 2+epsilon of
> mpn_sqrmod_bnm1?
They respectively cost FFT(2n), and FFT(n).
Regards,
m
--
http://bodrato.it/papers/
More information about the gmp-devel
mailing list