# middle products

David Harvey dmharvey at cims.nyu.edu
Wed Jun 17 23:54:58 CEST 2009

```On Jun 17, 2009, at 5:31 PM, Torbjorn Granlund wrote:

> David Harvey <dmharvey at cims.nyu.edu> writes:
>
>   I think the reason for this is that the K8 mulmid_basecase uses
>   internally an addmul_2 loop that runs at 2.375 cycles/limb, whereas
>   the division routines are only using addmul_1, which runs at 2.5
>   cycles/limb. If the division code did two rows at a time, it would
>   probably push this threshold somewhat higher (surely at least to the
>   mulmid karatsuba threshold)
>
> Maybe, maybe not.

Actually, what I said before was nonsense. There is no direct

Suppose that n * n multiplication costs n^2. (I ignore linear terms
in this email).

Then schoolbook 2n * n division with remainder also costs n^2.

Now what about middle-product-based reciprocal, assuming one
recursion layer before switching to mpn_tdiv_qr?

It should be:

-- one schoolbook n * n/2 division = n^2/4
-- one n * n/2 middle product = n^2/4
-- one n/2 * n/2 product = n^2/4

so the total is only (3/4)*n^2, less than the n^2 incurred by
mpn_tdiv_qr. So maybe this is the reason it wins so early, regardless
of the 2.375 vs 2.5 issue.

Now here is a best case scenario. Suppose the reciprocal is done
using quotient-only-division. Then it's reasonable to assume the cost
is n^2/2. The middle-product-based version will incur:

-- one schoolbook n * n/2 quotient-only-division = n^2/8
-- one n * n/2 middle product = n^2/4
-- one n/2 * n/2 truncated product (mulhi) = n^2/8

and of course the total is still n^2/4. It's pretty much doing the
same products in a different order.

I guess my point is: if everything was implemented "properly", then
the middle-product-based algorithm shouldn't be helpful until much
larger n; the only reason it should help is that it is more efficient
than truncated products for larger n.

david

```