Improved binvert algorithm

Thu Mar 19 23:47:08 CET 2026

Ciao,

9 feb 2026, 20:58 da sparks05 at proton.me:

> Niels Möller <nisse at lysator.liu.se> wrote:
>
>> Interesting! To me, the clever thing is that the sequence of y values is
>> computed independently of r, with only a single multiply latency. And
>> the series of r-values is done in parallel, also with only a single
>> multiply latency per step. And then the latency of the 1+y operation is
>> only on the y --> r connection, so it is only paid once.
>>
> The thing stopping me from preparing a patch is I don't know of
> a "has pipelined multiplier" symbol to make it conditional on.
> Any suggestions?
>
> For maximum speed, we want the squaring of y to issue *before* the
> inv *= 1+y multiply.  If the object code doesn't occur in that order,
> maybe tweaking the source to say "tmp = 1+y; y *= y; inv *= tmp;"
> will induce it to, and shave off another cycle.
>
When we use assembly, then scheduling is our job, when we write in C it's
a job for the compiler, isn't it?
Moreover, if I remember correctly, the ARM processors have a
MuLtiplyAccumulate instruction that can compute A * (B+1) without the
need to add 1 to B (one computes A + B * A).
Then, I'd declare as many variables y as needed and let the optimizer
decide. Something like:
#define binvert_limb(inv,n)					\
  do {								\
    mp_limb_t  __n = (n);					\
    mp_limb_t  __inv = CNST_LIMB (3) * __n ^ CNST_LIMB (2);	\
    ASSERT ((__n & 1) == 1);					\
\
    mp_limb_t  __y0 = CNST_LIMB (1) - __inv * __n;		\
    mp_limb_t  __y1 = __y0 * __y0;				\
    mp_limb_t  __y2 = __y1 * __y1;				\
    __inv = __inv * (1 + __y0) * (1 + __y1) * (1 + __y2);	\
\
    if (GMP_NUMB_BITS > 40)					\
      {								\
int  __invbits = 40;					\
mp_limb_t  __y = __y3;					\
do {							\
  __y *= __y;						\
  __inv *= __y + 1;					\
  __invbits *= 2;					\
} while (__invbits < GMP_NUMB_BITS);			\
      }								\
\
    ASSERT ((__inv * __n & GMP_NUMB_MASK) == 1);		\
    (inv) = __inv & GMP_NUMB_MASK;				\
  } while (0)

Ĝis,mb