sqrt algorithm

Tue Aug 11 12:47:14 UTC 2015

nisse at lysator.liu.se (Niels Möller) writes:

  > tune/speed-cs 1-9000 -f2 mpn_sqr mpn_sqrlo mpn_sqrmod_bnm1
  > overhead 5.84 cycles, units of 2.86e-10 secs, CPU freq 3500.08 MHz
  >               mpn_sqr     mpn_sqrlo mpn_sqrmod_bnm1
  > 1              #15.58         19.88         36.04
  > 2              #19.48         20.54         39.94
  > 4              #51.43         56.01         81.91
  > 8              139.29       #114.17        174.01
  > 16             453.79       #308.36        469.08
  > 32            1495.95       #974.61       1146.67
  > 64            4805.27       3421.11      #3096.75

  But for the cases where they compete, typically where the high half of
  the product is known, one should maybe add the (linear) reconstruction
  cost one will have for sqrmod_bnm1. Which is O(n), I think, but which
  might push the sqrlo range a bit wider.

This is an interesting observation.  We might not get this right for
other cutoff points today.  In some cases, I'd expect the cutoff between
functions A and B will be different depending on usage; some usages will
add a linear term to A or B.

  (And with assembly, I imagine
  mpn_sqrlo ought to be faster than mpn_sqr for the smallest sizes).

Indeed.

I took a new look at mullo now.  Our C code calls mul_1 and addmul_1
(never mul_2, addmul_2) and furthermore calls the functions with a count
argument so large that the return value is ignored.  I'd expect things
to be faster by decreasing the count and compensate by performing a
plain limb multiply in C.

-- 
Torbjörn
Please encrypt, key id 0xC8601622