Improved binvert algorithm
marco.bodrato at tutanota.com
marco.bodrato at tutanota.com
Thu Mar 19 23:47:08 CET 2026
Ciao,
9 feb 2026, 20:58 da sparks05 at proton.me:
> Niels Möller <nisse at lysator.liu.se> wrote:
>
>> Interesting! To me, the clever thing is that the sequence of y values is
>> computed independently of r, with only a single multiply latency. And
>> the series of r-values is done in parallel, also with only a single
>> multiply latency per step. And then the latency of the 1+y operation is
>> only on the y --> r connection, so it is only paid once.
>>
> The thing stopping me from preparing a patch is I don't know of
> a "has pipelined multiplier" symbol to make it conditional on.
> Any suggestions?
>
> For maximum speed, we want the squaring of y to issue *before* the
> inv *= 1+y multiply. If the object code doesn't occur in that order,
> maybe tweaking the source to say "tmp = 1+y; y *= y; inv *= tmp;"
> will induce it to, and shave off another cycle.
>
When we use assembly, then scheduling is our job, when we write in C it's
a job for the compiler, isn't it?
Moreover, if I remember correctly, the ARM processors have a
MuLtiplyAccumulate instruction that can compute A * (B+1) without the
need to add 1 to B (one computes A + B * A).
Then, I'd declare as many variables y as needed and let the optimizer
decide. Something like:
#define binvert_limb(inv,n) \
do { \
mp_limb_t __n = (n); \
mp_limb_t __inv = CNST_LIMB (3) * __n ^ CNST_LIMB (2); \
ASSERT ((__n & 1) == 1); \
\
mp_limb_t __y0 = CNST_LIMB (1) - __inv * __n; \
mp_limb_t __y1 = __y0 * __y0; \
mp_limb_t __y2 = __y1 * __y1; \
__inv = __inv * (1 + __y0) * (1 + __y1) * (1 + __y2); \
\
if (GMP_NUMB_BITS > 40) \
{ \
int __invbits = 40; \
mp_limb_t __y = __y3; \
do { \
__y *= __y; \
__inv *= __y + 1; \
__invbits *= 2; \
} while (__invbits < GMP_NUMB_BITS); \
} \
\
ASSERT ((__inv * __n & GMP_NUMB_MASK) == 1); \
(inv) = __inv & GMP_NUMB_MASK; \
} while (0)
Ĝis,mb
More information about the gmp-devel
mailing list