tg at gmplib.org
Tue Aug 11 12:47:14 UTC 2015
nisse at lysator.liu.se (Niels Möller) writes:
> tune/speed-cs 1-9000 -f2 mpn_sqr mpn_sqrlo mpn_sqrmod_bnm1
> overhead 5.84 cycles, units of 2.86e-10 secs, CPU freq 3500.08 MHz
> mpn_sqr mpn_sqrlo mpn_sqrmod_bnm1
> 1 #15.58 19.88 36.04
> 2 #19.48 20.54 39.94
> 4 #51.43 56.01 81.91
> 8 139.29 #114.17 174.01
> 16 453.79 #308.36 469.08
> 32 1495.95 #974.61 1146.67
> 64 4805.27 3421.11 #3096.75
But for the cases where they compete, typically where the high half of
the product is known, one should maybe add the (linear) reconstruction
cost one will have for sqrmod_bnm1. Which is O(n), I think, but which
might push the sqrlo range a bit wider.
This is an interesting observation. We might not get this right for
other cutoff points today. In some cases, I'd expect the cutoff between
functions A and B will be different depending on usage; some usages will
add a linear term to A or B.
(And with assembly, I imagine
mpn_sqrlo ought to be faster than mpn_sqr for the smallest sizes).
I took a new look at mullo now. Our C code calls mul_1 and addmul_1
(never mul_2, addmul_2) and furthermore calls the functions with a count
argument so large that the return value is ignored. I'd expect things
to be faster by decreasing the count and compensate by performing a
plain limb multiply in C.
Please encrypt, key id 0xC8601622
More information about the gmp-devel