Basecase NxM division is like long division done by hand, but in base 2^mp_bits_per_limb. See Knuth section 4.3.1 algorithm D, and mpn/generic/sb_divrem_mn.c.
Briefly stated, while the dividend remains larger than the divisor, a high quotient limb is formed and the Nx1 product q*d subtracted at the top end of the dividend. With a normalized divisor (most significant bit set), each quotient limb can be formed with a 2x1 division and a 1x1 multiplication plus some subtractions. The 2x1 division is by the high limb of the divisor and is done either with a hardware divide or a multiply by inverse (the same as in Single Limb Division) whichever is faster. Such a quotient is sometimes one too big, requiring an addback of the divisor, but that happens rarely.
With Q=N-M being the number of quotient limbs, this is an O(Q*M) algorithm and will run at a speed similar to a basecase QxM multiplication, differing in fact only in the extra multiply and divide for each of the Q quotient limbs.