div_qr_1 interface

Mon Oct 21 00:54:32 CEST 2013

nisse at lysator.liu.se (Niels Möller) writes:

  On my core2 laptop:

  $ ./speed -s 2-10,100,500 -C mpn_divrem_1.0x9999999999999999 mpn_div_qr_1.0x9999999999999999
  overhead 6.13 cycles, precision 10000 units of 8.33e-10 secs, CPU freq 1200.00 MHz
          mpn_divrem_1.0x9999999999999999 mpn_div_qr_1.0x9999999999999999
  2             60.6420      #39.9427
  3            #40.9839       55.0469
  4            #43.7667       44.4534
  5             44.6333      #38.9055
  6             39.6259      #34.4167
  7             34.0063      #32.4018
  8             30.1364      #28.5745
  9             29.6472      #27.4599
  10            29.1270      #26.7300
  100           24.7920      #20.6700
  500           24.4400      #19.7600

  So here it's a clear win, except an ugly regression for n = 3.

You might want to pass -p1000000 or something, to avoid startup noise.

  On shell, the same command gives:

  2            #37.4379       51.1157
  3            #30.0256       61.0904
  4            #25.8058       27.0781
  5            #23.2717       24.2831
  6            #21.7520       22.4346
  7            #20.5219       21.1111
  8            #19.4783       20.1101
  9            #18.7726       19.3369
  10           #18.3271       18.7228
  100          #13.8063       13.8175
  500          #13.2670       13.2750

  So here the new code is epsilon slower for the larger sizes. Maybe the
  loopmixer can help.

The old code runs optimally, given its instructions.
What is the latency critical path of the new code?

The performance for n=3 is poor for both processors.  Do you understand
the reason?

-- 
Torbjörn