# Division by 9 (and other factors of 2^60 - 1)

Niels Möller nisse at lysator.liu.se
Mon Oct 12 22:42:28 CEST 2009

```nisse at lysator.liu.se (Niels Möller) writes:

> Here's' some code for dividing using multiplication + Hensel-division
> by 2^60 - 1.

I've tried some timing now, running on shell.gmplib.org (currently a
core 2). The following variant,

mp_limb_t
bdiv_t60m1 (mp_ptr qp, mp_srcptr ap, mp_size_t n, mp_limb_t bd)
{
mp_size_t i;
mp_limb_t h;
mp_limb_t cy;

for (i = 0, h = cy = 0; i < n; i++)
{
mp_limb_t a = ap[i];
mp_limb_t q;
mp_limb_t t1, t0;
mp_limb_t s;

umul_ppmm (t1, t0, a, bd);
add_ssaaaa(t1, t0, t1, t0, 0, h);

/* Hensel division by 2^60 - 1, inverse -(2^60 + 1) */
t0 = - t0;
s = (t0 << 60);
q = s + t0;
qp[i] = q;

SUBC_LIMB (cy, h, t1, (q >> 4) - (-t0 > s) + cy);
}

return h;
}

seem to run at ~13.6 cycles. To be compared to 14.1 cycles for general
divexact_1 and 3.9 for divexact_by3 (in assembler). One could try
handcoded assembly, my most optimistic estimate of the above is a
latency of 6 cycles for the recurrency chain (not counting any needed
moves). Maybe one can do something clever to get the negation off the
critical path, replacing the SUBC_LIMB by ADDC_LIMB, but I doubt that
will save more than one cycle.

So my guess is that this method can be competitive only for processors
where the multiplication unit has poor throughput.

Repeated divexact_by3 seems to be the most efficient way, so far, to
divide by 9 on x86_64. It might be fun to try to do two iterations of
bdiv_dbm1c as a single loop. That would be able to handle division by
9, 15 and 25.

One other thing to try is 2^63 + 1 = 3^3 * 274177 * 67280421310721.
Not as many useful factors as 2^60 - 1, but sufficient for dividing by
9 and 27. Signs are going to be different, but I'm not sure they will
be easier to handle.

Regards,
/Niels
```