sqrt algorithm

Wed Aug 12 12:03:40 UTC 2015

"Marco Bodrato" <bodrato at mail.dm.unipi.it> writes:

  We have an explicit example of this: INV_MULMOD_BNM1_THRESHOLD is
  typically larger than the MULMOD_BNM1_THRESHOLD, the latter is only used
  internally .

OK.  These are widely apart, the quotient between them is 3 on average.
MULMOD_BNM1_THRESHOLD varies from 10 to 20, INV_MULMOD_BNM1_THRESHOLD
varies from 20 to 80.

  I tested this approach for sqrlo_basecase too, you can find the code
  enclosed by
  #ifdef SQRLO_SHORTCUT_MULTIPLICATIONS

  But I'm not sure it is faster, so it is currently disabled.

It will obviously be faster for machines where widening multiplication
is expensive, and in some other hardware cases.

How it will work for high-end CPUs, I cannot tell.  I would expect a
small speedup in most cases.  But using mul_2/addmul_2 when these exists
is more important.

There is a weird reason for why addmul_2 will pay off extra here: Loop
counter prediction problems for some CPUs.  The inner loop counts will
decrease every time the inner loop is repeated.  CPUs which base
predicition on branch count repetions will pay a pipeline replay for the
last inner loop back branch and in many cases again for the non-taken
inner loop fall-though.

Why does that hurt addmul_1 more than addmul_2?  Simply because the
latter will be executed twice as many times.

(I believe recent Intel CPUs detect the loop count register and predict
based in its value rather than on branch insn statistics/counts.)

I don't recall how we defined mullo/sqrlo wrt the size of the target
operand.  Can we write above the defined result, i.e., can we write the
full 2n product if it is convenient?

-- 
Torbjörn
Please encrypt, key id 0xC8601622