middle products

Wed Jun 17 23:31:12 CEST 2009

David Harvey <dmharvey at cims.nyu.edu> writes:

  I think the reason for this is that the K8 mulmid_basecase uses
  internally an addmul_2 loop that runs at 2.375 cycles/limb, whereas
  the division routines are only using addmul_1, which runs at 2.5
  cycles/limb. If the division code did two rows at a time, it would
  probably push this threshold somewhat higher (surely at least to the
  mulmid karatsuba threshold)

Maybe, maybe not.

The schoolbook division loop consists of two steps.

1. Quotient digit approximation
2. Approximate quotient application (partial remainder - divisor * quotient)

The (O(n^2) for all invocations) 2nd step could be sped up with addmul_2
(or more straightforwardly with submul_2).  But the (O(n) for all
invocations) quotient approximation will take more time, and computation
of a divisor inverse (O(1)) will need more time since a wider inverse
will be required.

I've always suspected schoolbook division could be beaten with code
based on short products.

-- 
Torbjörn