nisse at lysator.liu.se
Thu Dec 28 08:14:31 UTC 2017
"Marco Bodrato" <bodrato at mail.dm.unipi.it> writes:
> Il Ven, 6 Ottobre 2017 1:43 pm, Niels Möller ha scritto:
>> It might work to first take out powers of two, and then rewrite
>> multiplies as squares like
>> n (n-2) = (n-1)^2 - 1
>> n (n-2)(n-4)(n-6) = [(n-3)^2 - 5]^2 - 25
> Do you think it may be worth using tricks like that to reduce the number
> of hard-coded multiplications in the mul*() functions in bin_uiui ?
I think it's worth trying. Replacing multiplies by squares is no gain
for scalar numbers, but reducing the total number of multiply operations
is good. I'd expect that the multiply execution unit(s) is usually the
bottleneck for this code?
Let's look closer at one of the functions.
> static mp_limb_t
> mul6 (mp_limb_t m)
> - mp_limb_t m01 = (m + 0) * (m + 1);
> - mp_limb_t m23 = (m + 2) * (m + 3);
> - mp_limb_t m45 = (m + 4) * (m + 5) >> 1;
> - mp_limb_t m0123 = m01 * m23 >> 3;
> - return m0123 * m45;
> + mp_limb_t m05 = (m + 0) * (m + 5);
> + mp_limb_t m1234 = (m05 + 4) * (m05 + 6) >> 3;
> + return m1234 * (m05 >> 1);
If I read it correctly, the change reduces the number of multiplies from
5 to 3. And number of non-multiply operations is reduced too, from 7 to
The dependency depth is unchanged, though, since before the change, the
first three multiplies are independent and could be done in parallel,
while after the change, the multiplies are fully dependent. So *if*
there's no shortage of execution resources, execution time of both
versions should corresponding to three multiplication latencies (plus a
cycle or two for the additions on the dependency chain).
To me, it seems likely that the reduced resource usage of the changed
version will bring a real gain.
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel