div_qr_1 interface
Torbjorn Granlund
tg at gmplib.org
Mon Oct 21 00:54:32 CEST 2013
nisse at lysator.liu.se (Niels Möller) writes:
On my core2 laptop:
$ ./speed -s 2-10,100,500 -C mpn_divrem_1.0x9999999999999999 mpn_div_qr_1.0x9999999999999999
overhead 6.13 cycles, precision 10000 units of 8.33e-10 secs, CPU freq 1200.00 MHz
mpn_divrem_1.0x9999999999999999 mpn_div_qr_1.0x9999999999999999
2 60.6420 #39.9427
3 #40.9839 55.0469
4 #43.7667 44.4534
5 44.6333 #38.9055
6 39.6259 #34.4167
7 34.0063 #32.4018
8 30.1364 #28.5745
9 29.6472 #27.4599
10 29.1270 #26.7300
100 24.7920 #20.6700
500 24.4400 #19.7600
So here it's a clear win, except an ugly regression for n = 3.
You might want to pass -p1000000 or something, to avoid startup noise.
On shell, the same command gives:
2 #37.4379 51.1157
3 #30.0256 61.0904
4 #25.8058 27.0781
5 #23.2717 24.2831
6 #21.7520 22.4346
7 #20.5219 21.1111
8 #19.4783 20.1101
9 #18.7726 19.3369
10 #18.3271 18.7228
100 #13.8063 13.8175
500 #13.2670 13.2750
So here the new code is epsilon slower for the larger sizes. Maybe the
loopmixer can help.
The old code runs optimally, given its instructions.
What is the latency critical path of the new code?
The performance for n=3 is poor for both processors. Do you understand
the reason?
--
Torbjörn
More information about the gmp-devel
mailing list