hgcd1/2

Fri Oct 4 07:59:00 UTC 2019

Ciao

Il Gio, 5 Settembre 2019 10:33 am, Torbjörn Granlund ha scritto:
> One of my previous proposal looks like this, when put into its right form:
>
> mp_limb_t
> div1 (mp_limb_t n0, mp_limb_t d0)
> {
>   if (UNLIKELY ((d0 >> 61) != 0))
>     return n0 / d0;
>
>   if (UNLIKELY (n0 >= (d0 << 3)))
>     return n0 / d0;
>   else {
>     mp_limb_t q, mask;
>
>     d0 <<= 2;
>
>     mask = -(mp_limb_t) (n0 >= d0);
>     q = 4 & mask;
>     n0 -= d0 & mask;
>
>     d0 >>= 1;
>     mask = -(mp_limb_t) (n0 >= d0);
>     q += 2 & mask;
>     n0 -= d0 & mask;
>
>     d0 >>= 1;
>     mask = -(mp_limb_t) (n0 >= d0);
>     q += 1 & mask;
>     n0 -= d0 & mask;
>
>     return q;
>   }
> }

> Some timing tests indicate that it is 6 times faster on 'shell' for
> quotients < 8.  For hgcd, that would mean almost all quotients.

As I realised on the powm thread...
On shell, also the following rewrite is compiled without branches:

static inline mp_double_limb_t
div1 (mp_limb_t n0, mp_limb_t d0)
{
  mp_double_limb_t res;
  if (UNLIKELY ((d0 >> (GMP_LIMB_BITS - 3)) != 0)
      || UNLIKELY (n0 >= (d0 << 3)))
    {
      res.d1 = n0 / d0;
      res.d0 = n0 - res.d1 * d0;
    }
  else
    {
      mp_limb_t q;

      d0 <<= 2;

      q = (n0 >= d0);
      n0 -= (n0 >= d0 ? d0 : 0);

      d0 >>= 1;
      q <<= 1;
      q += (n0 >= d0);
      n0 -= (n0 >= d0 ? d0 : 0);

      d0 >>= 1;
      q <<= 1;
      q += (n0 >= d0);
      n0 -= (n0 >= d0 ? d0 : 0);

      res.d0 = n0;
      res.d1 = q;
    }
  return res;
}

The last
      d0 >>= 1;
      q <<= 1;
      q += (n0 >= d0);
      n0 -= (n0 >= d0 ? d0 : 0);
block gets compiled (on shell, with -O2) into:
        shrq    %rsi
        xorl    %edx, %edx
        cmpq    %rsi, %rdi
        setae   %dl
        leaq    (%rdx,%rax,2), %rax
        cmovbq  %rcx, %rsi
        subq    %rsi, %rdi

Not bad, is it?

Must we use masks to obtain branchless code? Or should we let the compiler
choose?

Ĝis,
m

-- 
http://bodrato.it/papers/