udiv_qr_3by2 vs divappr
Torbjörn Granlund
tg at gmplib.org
Tue Aug 28 00:13:37 UTC 2018
tg at gmplib.org (Torbjörn Granlund) writes:
Last time I explored this, the division instructions (div/idiv) ran at
varying speed as long as the upper dividend word is zero. When it is,
the timing is rougly linear in either log2(dividend) or log2(quotient).
But the fastst time is still some 10 cycles, IIRC. And as Nisse says,
it is not pipelined.
The relative throughput between mul and div can still be on the order of
10x in the best case and 100x in the worst case.
I made some renewed measuments. The left column is ideal, small
dividend and small quotient, the right column is for a two word dividend
and large quotient. The numbers are approximate.
intel skylake 25 86
intel broadwell 25 82
intel haswell 27 79
intel sandybridge 28 83
intel nehalem 20 72
amd zen+ 15 44
amd excavator 15 75
amd piledriver 14 74
amd k10 20 79
intel silvermont 28 97
intel goldmont 13 42
The first 4 intel CPUs can do 1 mul/cycle, while nehalem, zen, k10 can
sustain one do 1/2 mul/cycle. The other CPUs do from 1/4 to just 1/8.
I think it is fair to say that div is still way slower than mul. Only
recent amd CPUs and intel goldmont have really fast small dividend
division.
GMP should still stay away from the div instruction, perhaps with the
exception for amd zen and intel goldmont and then only iff the divisor
in not invariant.
--
Torbjörn
Please encrypt, key id 0xC8601622
More information about the gmp-devel
mailing list