middle products

Wed Jun 17 23:54:58 CEST 2009

On Jun 17, 2009, at 5:31 PM, Torbjorn Granlund wrote:

> David Harvey <dmharvey at cims.nyu.edu> writes:
>
>   I think the reason for this is that the K8 mulmid_basecase uses
>   internally an addmul_2 loop that runs at 2.375 cycles/limb, whereas
>   the division routines are only using addmul_1, which runs at 2.5
>   cycles/limb. If the division code did two rows at a time, it would
>   probably push this threshold somewhat higher (surely at least to the
>   mulmid karatsuba threshold)
>
> Maybe, maybe not.

Actually, what I said before was nonsense. There is no direct  
comparison to be made.

Suppose that n * n multiplication costs n^2. (I ignore linear terms  
in this email).

Then schoolbook 2n * n division with remainder also costs n^2.

Now what about middle-product-based reciprocal, assuming one  
recursion layer before switching to mpn_tdiv_qr?

It should be:

-- one schoolbook n * n/2 division = n^2/4
-- one n * n/2 middle product = n^2/4
-- one n/2 * n/2 product = n^2/4

so the total is only (3/4)*n^2, less than the n^2 incurred by  
mpn_tdiv_qr. So maybe this is the reason it wins so early, regardless  
of the 2.375 vs 2.5 issue.

Now here is a best case scenario. Suppose the reciprocal is done  
using quotient-only-division. Then it's reasonable to assume the cost  
is n^2/2. The middle-product-based version will incur:

-- one schoolbook n * n/2 quotient-only-division = n^2/8
-- one n * n/2 middle product = n^2/4
-- one n/2 * n/2 truncated product (mulhi) = n^2/8

and of course the total is still n^2/4. It's pretty much doing the  
same products in a different order.

I guess my point is: if everything was implemented "properly", then  
the middle-product-based algorithm shouldn't be helpful until much  
larger n; the only reason it should help is that it is more efficient  
than truncated products for larger n.

david