div_qr_1 interface
Torbjorn Granlund
tg at gmplib.org
Mon Oct 21 13:02:32 CEST 2013
nisse at lysator.liu.se (Niels Möller) writes:
Will try that. I think one could also try to delay the quotient store
one iteration, keeping "Q1" in a register until the next iteration. Then
one gets rid of the
adc Q2,8(QP, UN, 8)
in the loop, using only a single store per iteration in the likely case.
May need yet another register, though.
On Intel chips, op-to-mem is expensive. Even op-from-memory is often
slower than load+op. (I understand the register shortage problem.)
> I suspect one or two of the register-to-register copy insns could be
> optimised out.
Maybe. And it would be easier to avoid moves if one unrolls the loop
twice, switching roles U0<->U1 and Q0<->Q1. But that makes it a bit more
bloated, of course.
It might be worth it, since this is an importand operation.
> In order to run this through the loopmixer, you need to setup data in
> the prologue which makes the adjustment branch to never be taken.
> Letting the inverse be 0 or else B-1 might work...
I vaguely recall some previous attempt at loopmixing this, but I don't
remember any success.
Let's take a look at current performance on all amd64 CPUs except nocona
(=pentium4). I compare the pi variants here.
Conclusions:
* The code is no win for AMD k10/k8 (although close to 10 c/l might well be
possible)
* The code is a big win for AMD bulldozer and also for piledriver
* The code is a big win for Intel core2 (alias conroe)
* The code is a cycle slower for Intel sandybridge
* The code is a cycle faster on Intel nehalem, ivybridge, haswell
* The code is a big win for VIA nano
In ~tege/GMP/newdiv/div_1n_pi2-x86_64.asm I claim 9.75 c/l (and that 7
c/l is possible) for k10/k8, 16 c/l for core2, and 13.25 c/l for
nehalem. Of course, the precomputation cost there is much higher.
******** k10 ********
overhead 6.00 cycles, precision 1000000 units of 3.12e-10 secs, CPU freq 3200.35 MHz
mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef
1 #12.0018 26.0043
2 19.5030 #19.5024
3 17.6695 #17.3362
5 15.8019 #15.6019
8 14.7518 #14.6267
13 #14.0895 15.0901
22 #14.1463 14.2366
37 #13.6849 13.7393
62 #13.4139 13.4445
105 #13.2498 13.2589
178 #13.1524 13.1632
302 #13.0952 13.1011
513 #13.0607 13.0642
872 #13.0302 13.0325
******** bulldozer ********
overhead 6.00 cycles, precision 1000000 units of 2.77e-10 secs, CPU freq 3612.09 MHz
mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef
1 #13.9118 30.8628
2 #22.5047 24.0040
3 20.6703 #20.3110
5 #18.0036 20.0033
8 #17.2535 19.2532
13 #16.7725 19.8804
22 #17.0943 20.5489
37 #16.6519 20.4899
62 #16.3905 20.2277
105 #16.2322 20.1748
178 #16.1383 20.0710
302 #16.0895 20.0499
513 #16.0513 20.0218
872 #16.0337 20.0186
******** piledriver ********
overhead 6.00 cycles, precision 1000000 units of 7.14e-10 secs, CPU freq 4000.00 MHz
mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef
1 #13.4460 27.7072
2 #21.2536 22.0034
3 #19.1284 20.6698
5 #17.6283 19.6027
8 #16.8365 19.1819
13 #16.7634 19.3874
22 #16.5480 19.1393
37 #16.5433 18.9761
62 #16.3419 18.8095
105 #16.2121 18.6698
178 #16.0991 18.6101
302 #16.0503 18.5661
513 #16.2965 18.5379
872 #16.3580 19.0568
******** core2 ********
overhead 6.01 cycles, precision 1000000 units of 4.69e-10 secs, CPU freq 2132.93 MHz
mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef
1 #15.7048 28.7024
2 #26.5272 26.5408
3 #24.7981 25.2783
5 #21.9089 24.9270
8 #20.9994 24.4683
13 #20.4778 24.1549
22 #20.0956 23.8461
37 #19.7079 23.8088
62 #19.6855 23.8366
105 #19.5935 23.9688
178 #19.3434 23.8856
302 #19.3213 23.8421
513 #19.4093 23.8145
872 #19.3424 23.8016
******** nehalem ********
overhead 6.00 cycles, precision 1000000 units of 3.75e-10 secs, CPU freq 2670.00 MHz
mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef
1 #12.1014 24.6814
2 #21.0024 21.8684
3 20.9170 #20.5440
5 #19.6475 20.3452
8 #19.1692 20.0459
13 #19.1643 19.7841
22 #18.9714 19.4242
37 #18.9281 19.6363
62 #18.7318 19.3491
105 #18.9929 19.2355
178 #18.7822 19.2779
302 #18.7368 19.1683
513 #18.7251 19.1364
872 #18.6993 19.1451
******** sandybridge ********
overhead 6.00 cycles, precision 1000000 units of 3.02e-10 secs, CPU freq 3311.22 MHz
mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef
1 #11.0014 19.0029
2 14.5490 #14.0978
3 15.2221 #13.5232
5 15.0096 #13.5554
8 14.9704 #13.7121
13 15.0515 #13.8339
22 15.1180 #13.9051
37 15.6663 #14.3060
62 15.1635 #14.3427
105 15.2652 #14.3665
178 15.3321 #14.3720
302 15.2939 #14.3763
513 15.2368 #14.3822
872 15.1984 #14.3878
******** ivybridge ********
overhead 6.56 cycles, precision 1000000 units of 2.86e-10 secs, CPU freq 3500.00 MHz
mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef
1 #11.0014 18.0029
2 #12.9980 13.2798
3 13.4954 #12.3369
5 13.2389 #12.7242
8 #13.2280 13.2300
13 #13.2331 13.6025
22 #13.1963 13.8519
37 #13.5290 13.9966
62 #13.1636 14.0779
105 #13.1274 14.2256
178 #13.1143 14.2060
302 #13.0608 14.2141
513 #13.1540 14.2050
872 #13.1764 14.2006
******** haswell ********
overhead 5.00 cycles, precision 1000000 units of 3.46e-10 secs, CPU freq 2893.21 MHz
mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef
1 #9.0015 17.0025
2 12.0021 #11.7526
3 11.5992 #11.0628
5 11.7255 #11.5849
8 #11.7214 12.3431
13 #11.7310 12.8741
22 #11.7497 13.2291
37 #12.0599 13.7739
62 #12.0945 13.7338
105 #12.0774 13.7221
178 #12.0205 13.7119
302 #12.0197 13.7050
513 #12.0365 13.7028
872 #12.0426 13.6965
******** vianano ********
overhead 9.01 cycles, precision 1000000 units of 6.25e-10 secs, CPU freq 1600.00 MHz
mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef
1 #22.5221 41.0437
2 #31.5313 32.5324
3 #28.0280 29.6950
5 #24.4255 27.4286
8 #22.3986 26.1515
13 #21.1205 26.6420
22 #21.0681 25.5705
37 #20.2399 24.9434
62 #19.7464 24.5723
105 #19.4484 24.3500
178 #19.2731 24.2168
302 #19.1679 24.1375
513 #19.1063 24.0905
872 #19.0714 24.0626
shell$
--
Torbjörn
More information about the gmp-devel
mailing list