udiv_qr_3by2 vs divappr

Thu Sep 20 08:55:54 UTC 2018

       Dear Torbjörn,

this looks quite interesting, I'm looking forward for cycle numbers.
(I guess you target the case where the divisor is not invariant, otherwise
you might do some preprocessing to speed up the partial quotient
approximation.)

Paul

> From: tg at gmplib.org (Torbjörn Granlund)
> Date: Wed, 19 Sep 2018 18:18:29 +0200
> 
> This is a sketch of an idea which I have had for a long time, which we
> actually might have discussed here before.  The idea is to cut a large
> O(nn-dn) time off of schoolbook division.
> 
> Normally, the time taken by schoolbook division is latency(mpn_submul_1)
> + 2*latency(mul) + C.  Here C is the number of plain operations of
> quotient limb approximation and mul means a scalar multiply used by the
> same.
> 
> Couldn't we make the quotient limb approximation in parallel with the
> O(n^2) work?  I think we can, but it requires that we run the submul_1
> partially backwards, and thereby produce the most significant partial
> remainder limbs early.
> 
> But that's impossible, right?  The quotient limb approximation is, eh,
> approximate after all!  We cannot know the most significant partial
> remainder limbs as early as required!  No, we cannot know them for sure,
> but we can almost always tell.  And when we're wrong, we need to add
> back and recompute.  Below is a very rough implementation.  (It surely is
> not correct wrt indexes.)
> 
> The hope is that this could run almost as as fast as mul_basecase.
> There, and in this proposed algorithm, subsequent addmul_1/submul_1 can
> actually overlap with the help of OoO execution.  So hopefully on good
> OoO pipelines this could not just hide the division limb approximation
> latency, but actually hide much of the feed-in and wind-down latency of
> submul_1!
> 
> qnext = divappr (np[1], np[0], d1, d0);
> for (...)
>   {
>     // Apply q to most significant partial remainder limbs
>     cy = submul_1 (np - 2, dp + dn - 2, qnext, 2);
> 
>     q = qnext;
>     // Now, with the most significant few limbs of the next partial remainder
>     // computed to some accuracy, compute q for the next iteration.  This might
>     // overestimate q worse than with the old algorithms.
>     qnext = divappr (np[1], np[0], d1, d0);
> 
>     // Apply q to all but the most significant partial remainder limbs.
>     cy = submul_1 (np - dn, dp, q, dn - 2);
> 
>     if (UNLIKELY (cy > np[0]))
>       {
>         np[0] -= cy;
> 	// q was too large, qnext is garbage so recompute it!
> 	// (We could perhaps recompute it more cleverly.)
> 	ASSERT_CARRY(mpn_add_n (np - dn, np - dn, dp, dn));
> 	qnext = divappr (np[1], np[0], d1, d0);
>       }
>     else
>       np[0] -= cy;
>     np--;
>   }
> 
> -- 
> Torbjörn
> Please encrypt, key id 0xC8601622
> _______________________________________________
> gmp-devel mailing list
> gmp-devel at gmplib.org
> https://gmplib.org/mailman/listinfo/gmp-devel