div_qr_1 interface

Torbjorn Granlund tg at gmplib.org
Mon Oct 21 13:02:32 CEST 2013


nisse at lysator.liu.se (Niels Möller) writes:

  Will try that. I think one could also try to delay the quotient store
  one iteration, keeping "Q1" in a register until the next iteration. Then
  one gets rid of the
  
  	adc	Q2,8(QP, UN, 8)
  
  in the loop, using only a single store per iteration in the likely case.
  May need yet another register, though.
  
On Intel chips, op-to-mem is expensive.  Even op-from-memory is often
slower than load+op.  (I understand the register shortage problem.)

  > I suspect one or two of the register-to-register copy insns could be
  > optimised out.
  
  Maybe. And it would be easier to avoid moves if one unrolls the loop
  twice, switching roles U0<->U1 and Q0<->Q1. But that makes it a bit more
  bloated, of course.
  
It might be worth it, since this is an importand operation.

  > In order to run this through the loopmixer, you need to setup data in
  > the prologue which makes the adjustment branch to never be taken.
  > Letting the inverse be 0 or else B-1 might work...
  
  I vaguely recall some previous attempt at loopmixing this, but I don't
  remember any success.

Let's take a look at current performance on all amd64 CPUs except nocona
(=pentium4).  I compare the pi variants here.

Conclusions:

* The code is no win for AMD k10/k8 (although close to 10 c/l might well be
  possible)

* The code is a big win for AMD bulldozer and also for piledriver
  
* The code is a big win for Intel core2 (alias conroe)

* The code is a cycle slower for Intel sandybridge

* The code is a cycle faster on Intel nehalem, ivybridge, haswell

* The code is a big win for VIA nano

In ~tege/GMP/newdiv/div_1n_pi2-x86_64.asm I claim 9.75 c/l (and that 7
c/l is possible) for k10/k8, 16 c/l for core2, and 13.25 c/l for
nehalem.  Of course, the precomputation cost there is much higher.

******** k10 ********
overhead 6.00 cycles, precision 1000000 units of 3.12e-10 secs, CPU freq 3200.35 MHz
        mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef
1            #12.0018       26.0043
2             19.5030      #19.5024
3             17.6695      #17.3362
5             15.8019      #15.6019
8             14.7518      #14.6267
13           #14.0895       15.0901
22           #14.1463       14.2366
37           #13.6849       13.7393
62           #13.4139       13.4445
105          #13.2498       13.2589
178          #13.1524       13.1632
302          #13.0952       13.1011
513          #13.0607       13.0642
872          #13.0302       13.0325
******** bulldozer ********
overhead 6.00 cycles, precision 1000000 units of 2.77e-10 secs, CPU freq 3612.09 MHz
        mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef
1            #13.9118       30.8628
2            #22.5047       24.0040
3             20.6703      #20.3110
5            #18.0036       20.0033
8            #17.2535       19.2532
13           #16.7725       19.8804
22           #17.0943       20.5489
37           #16.6519       20.4899
62           #16.3905       20.2277
105          #16.2322       20.1748
178          #16.1383       20.0710
302          #16.0895       20.0499
513          #16.0513       20.0218
872          #16.0337       20.0186
******** piledriver ********
overhead 6.00 cycles, precision 1000000 units of 7.14e-10 secs, CPU freq 4000.00 MHz
        mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef
1            #13.4460       27.7072
2            #21.2536       22.0034
3            #19.1284       20.6698
5            #17.6283       19.6027
8            #16.8365       19.1819
13           #16.7634       19.3874
22           #16.5480       19.1393
37           #16.5433       18.9761
62           #16.3419       18.8095
105          #16.2121       18.6698
178          #16.0991       18.6101
302          #16.0503       18.5661
513          #16.2965       18.5379
872          #16.3580       19.0568

******** core2 ********
overhead 6.01 cycles, precision 1000000 units of 4.69e-10 secs, CPU freq 2132.93 MHz
        mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef
1            #15.7048       28.7024
2            #26.5272       26.5408
3            #24.7981       25.2783
5            #21.9089       24.9270
8            #20.9994       24.4683
13           #20.4778       24.1549
22           #20.0956       23.8461
37           #19.7079       23.8088
62           #19.6855       23.8366
105          #19.5935       23.9688
178          #19.3434       23.8856
302          #19.3213       23.8421
513          #19.4093       23.8145
872          #19.3424       23.8016
******** nehalem ********
overhead 6.00 cycles, precision 1000000 units of 3.75e-10 secs, CPU freq 2670.00 MHz
        mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef
1            #12.1014       24.6814
2            #21.0024       21.8684
3             20.9170      #20.5440
5            #19.6475       20.3452
8            #19.1692       20.0459
13           #19.1643       19.7841
22           #18.9714       19.4242
37           #18.9281       19.6363
62           #18.7318       19.3491
105          #18.9929       19.2355
178          #18.7822       19.2779
302          #18.7368       19.1683
513          #18.7251       19.1364
872          #18.6993       19.1451
******** sandybridge ********
overhead 6.00 cycles, precision 1000000 units of 3.02e-10 secs, CPU freq 3311.22 MHz
        mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef
1            #11.0014       19.0029
2             14.5490      #14.0978
3             15.2221      #13.5232
5             15.0096      #13.5554
8             14.9704      #13.7121
13            15.0515      #13.8339
22            15.1180      #13.9051
37            15.6663      #14.3060
62            15.1635      #14.3427
105           15.2652      #14.3665
178           15.3321      #14.3720
302           15.2939      #14.3763
513           15.2368      #14.3822
872           15.1984      #14.3878
******** ivybridge ********
overhead 6.56 cycles, precision 1000000 units of 2.86e-10 secs, CPU freq 3500.00 MHz
        mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef
1            #11.0014       18.0029
2            #12.9980       13.2798
3             13.4954      #12.3369
5             13.2389      #12.7242
8            #13.2280       13.2300
13           #13.2331       13.6025
22           #13.1963       13.8519
37           #13.5290       13.9966
62           #13.1636       14.0779
105          #13.1274       14.2256
178          #13.1143       14.2060
302          #13.0608       14.2141
513          #13.1540       14.2050
872          #13.1764       14.2006
******** haswell ********
overhead 5.00 cycles, precision 1000000 units of 3.46e-10 secs, CPU freq 2893.21 MHz
        mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef
1             #9.0015       17.0025
2             12.0021      #11.7526
3             11.5992      #11.0628
5             11.7255      #11.5849
8            #11.7214       12.3431
13           #11.7310       12.8741
22           #11.7497       13.2291
37           #12.0599       13.7739
62           #12.0945       13.7338
105          #12.0774       13.7221
178          #12.0205       13.7119
302          #12.0197       13.7050
513          #12.0365       13.7028
872          #12.0426       13.6965

******** vianano ********
overhead 9.01 cycles, precision 1000000 units of 6.25e-10 secs, CPU freq 1600.00 MHz
        mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef
1            #22.5221       41.0437
2            #31.5313       32.5324
3            #28.0280       29.6950
5            #24.4255       27.4286
8            #22.3986       26.1515
13           #21.1205       26.6420
22           #21.0681       25.5705
37           #20.2399       24.9434
62           #19.7464       24.5723
105          #19.4484       24.3500
178          #19.2731       24.2168
302          #19.1679       24.1375
513          #19.1063       24.0905
872          #19.0714       24.0626
shell$ 

-- 
Torbjörn


More information about the gmp-devel mailing list