udiv_qr_3by2 vs divappr

Tue Aug 28 00:13:37 UTC 2018

tg at gmplib.org (Torbjörn Granlund) writes:

  Last time I explored this, the division instructions (div/idiv) ran at
  varying speed as long as the upper dividend word is zero.  When it is,
  the timing is rougly linear in either log2(dividend) or log2(quotient).
  But the fastst time is still some 10 cycles, IIRC.  And as Nisse says,
  it is not pipelined.

  The relative throughput between mul and div can still be on the order of
  10x in the best case and 100x in the worst case.

I made some renewed measuments.  The left column is ideal, small
dividend and small quotient, the right column is for a two word dividend
and large quotient.  The numbers are approximate.

intel skylake      25 86
intel broadwell    25 82
intel haswell      27 79
intel sandybridge  28 83
intel nehalem      20 72
amd zen+           15 44
amd excavator      15 75
amd piledriver     14 74
amd k10            20 79
intel silvermont   28 97
intel goldmont     13 42

The first 4 intel CPUs can do 1 mul/cycle, while nehalem, zen, k10 can
sustain one do 1/2 mul/cycle.  The other CPUs do from 1/4 to just 1/8.

I think it is fair to say that div is still way slower than mul.  Only
recent amd CPUs and intel goldmont have really fast small dividend
division.

GMP should still stay away from the div instruction, perhaps with the
exception for amd zen and intel goldmont and then only iff the divisor
in not invariant.

-- 
Torbjörn
Please encrypt, key id 0xC8601622