hgcd1/2
Marco Bodrato
bodrato at mail.dm.unipi.it
Fri Oct 4 07:59:00 UTC 2019
Ciao
Il Gio, 5 Settembre 2019 10:33 am, Torbjörn Granlund ha scritto:
> One of my previous proposal looks like this, when put into its right form:
>
> mp_limb_t
> div1 (mp_limb_t n0, mp_limb_t d0)
> {
> if (UNLIKELY ((d0 >> 61) != 0))
> return n0 / d0;
>
> if (UNLIKELY (n0 >= (d0 << 3)))
> return n0 / d0;
> else {
> mp_limb_t q, mask;
>
> d0 <<= 2;
>
> mask = -(mp_limb_t) (n0 >= d0);
> q = 4 & mask;
> n0 -= d0 & mask;
>
> d0 >>= 1;
> mask = -(mp_limb_t) (n0 >= d0);
> q += 2 & mask;
> n0 -= d0 & mask;
>
> d0 >>= 1;
> mask = -(mp_limb_t) (n0 >= d0);
> q += 1 & mask;
> n0 -= d0 & mask;
>
> return q;
> }
> }
> Some timing tests indicate that it is 6 times faster on 'shell' for
> quotients < 8. For hgcd, that would mean almost all quotients.
As I realised on the powm thread...
On shell, also the following rewrite is compiled without branches:
static inline mp_double_limb_t
div1 (mp_limb_t n0, mp_limb_t d0)
{
mp_double_limb_t res;
if (UNLIKELY ((d0 >> (GMP_LIMB_BITS - 3)) != 0)
|| UNLIKELY (n0 >= (d0 << 3)))
{
res.d1 = n0 / d0;
res.d0 = n0 - res.d1 * d0;
}
else
{
mp_limb_t q;
d0 <<= 2;
q = (n0 >= d0);
n0 -= (n0 >= d0 ? d0 : 0);
d0 >>= 1;
q <<= 1;
q += (n0 >= d0);
n0 -= (n0 >= d0 ? d0 : 0);
d0 >>= 1;
q <<= 1;
q += (n0 >= d0);
n0 -= (n0 >= d0 ? d0 : 0);
res.d0 = n0;
res.d1 = q;
}
return res;
}
The last
d0 >>= 1;
q <<= 1;
q += (n0 >= d0);
n0 -= (n0 >= d0 ? d0 : 0);
block gets compiled (on shell, with -O2) into:
shrq %rsi
xorl %edx, %edx
cmpq %rsi, %rdi
setae %dl
leaq (%rdx,%rax,2), %rax
cmovbq %rcx, %rsi
subq %rsi, %rdi
Not bad, is it?
Must we use masks to obtain branchless code? Or should we let the compiler
choose?
Ĝis,
m
--
http://bodrato.it/papers/
More information about the gmp-devel
mailing list