More div_qr_2 assembler
Torbjorn Granlund
tg at gmplib.org
Thu Mar 31 09:13:14 CEST 2011
nisse at lysator.liu.se (Niels Möller) writes:
$ ./speed -o cycles-broken -p 100000 -C -s3-10,50,200,500,1500 mpn_divrem_2 mpn_div_qr_2n mpn_div_qr_2u
clock_gettime is 1.000ns accurate
overhead 6.04 cycles, precision 100000 units of 1.00e-09 secs, CPU freq 1200.00 MHz
mpn_divrem_2 mpn_div_qr_2n mpn_div_qr_2u
3 #36.0929 41.9874 58.5886
4 #35.9422 39.5447 51.7249
5 #35.0578 38.5444 47.5463
6 #36.0060 37.4358 45.4954
7 #35.1112 36.6949 43.4555
8 #34.5703 36.2022 42.9322
9 #34.5919 36.1517 41.4659
10 #34.7529 37.0281 40.9467
50 34.5641 #33.4350 35.3366
200 34.3937 #33.0381 34.3683
500 34.2357 #32.8356 33.4413
1500 34.0592 #30.2301 30.9165
$ ./speed -o cycles-broken -p 100000 -c -s3 mpn_divrem_2 mpn_div_qr_2n mpn_div_qr_2u
clock_gettime is 1.000ns accurate
overhead 6.04 cycles, precision 100000 units of 1.00e-09 secs, CPU freq 1200.00 MHz
mpn_divrem_2 mpn_div_qr_2n mpn_div_qr_2u
3 #109.79 125.96 172.92
So for the unnormalized case, we get an additional constant overhead of
47 cycles (the main cost here is the more complicated handling of qh,
which may be almost a full limb), and then the normalization seems to
cost a cycle per quotient limb. Loopmixing both the normalized and
unnormalized loops might be worthwhile.
How can you time all three using the same speed command, given that the
2u variant don't accept the same divisors as the other two?
To fairly compare divrem_2--now used for all divisors but with
preshifting of the operands--to the new shift-on-the-fly code, I think
we should include the time for shifting. Perhaps add a new target to
tune/speed doing exactly that?
The slowdown for normalised operands is more worrying. Except that I
primarily worry for GMP high-end processors in this case, i.e., AMD
K8-K10 and to slightly lesser Intel Sandy Bridge.
Current code uses shld (and shrd at the end). This means that the
normalized function could set the shift count to zero and jump to the
unnormalized loop after having determined qh, reducing code size a bit.
But since SHLD is slow on some processors, we'd need code without shld
(depending on the SHLD_SLOW configure variable), and then it gets harder
to handle a zero shift count.
SHLD_SLOW means "SHLD_SUPERSLOW"; some machines, e.g., AMD K8-K10, have
slow but not superslow SHLD. (Of the 64-bit processors, Intel's atom
and VIA's nano have superslow SHLD.)
--
Torbjörn
More information about the gmp-devel
mailing list