div_qr_1n_pi1

Sun Jun 6 18:18:36 UTC 2021

nisse at lysator.liu.se (Niels Möller) writes:

> You're idea of conditonally adding the invariant d * B2 at the right
> place is also interesting,

I've tried it out. Works nicely, but no speedup on my machine. I'm
attaching another patch. There are then 4 methods:

method 1: Old loop around udiv_qrnnd_preinv.

method 2: The clever code from 10 years ago, with the microoptimization
  I commited the other day.

method 3: More or less the same as I posted a few days ago.

method 4: Postpones the update u1 -= u2 d, off the critical recurrency
  chain. Instead, conditionally adds in the constant B2 (B - d) to the
  lower u limbs.

Benchmark on my laptop:

$ ./speed -p 1000000 -s 2-20 -C mpn_div_qr_1n_pi1.0x8765432108765432 mpn_div_qr_1n_pi1_1.0x8765432108765432 mpn_div_qr_1n_pi1_2.0x8765432108765432 mpn_div_qr_1n_pi1_3.0x8765432108765432 mpn_div_qr_1n_pi1_4.0x8765432108765432
overhead 2.63 cycles, precision 1000000 units of 7.16e-10 secs, CPU freq 1396.05 MHz
        mpn_div_qr_1n_pi1.0x8765432108765432 mpn_div_qr_1n_pi1_1.0x8765432108765432 mpn_div_qr_1n_pi1_2.0x8765432108765432 mpn_div_qr_1n_pi1_3.0x8765432108765432 mpn_div_qr_1n_pi1_4.0x8765432108765432
2              4.4566       #3.9635        5.8490        5.4085        6.0087
3              4.4323       #4.2441        5.8115        5.1158        5.7832
4              4.5348       #4.3992        5.7306        5.0807        5.8611
5             #4.5534        4.6698        5.7493        4.9803        5.5605
6             #4.5653        4.8412        5.7497        4.9129        5.6516
7             #4.6069        5.1388        5.7388        4.8811        5.6110
8             #4.6202        5.5073        5.7423        4.9359        5.5695
9             #4.6433        5.7341        5.7537        4.9357        5.5407
10            #4.6436        5.9595        5.7400        4.9231        5.5428
11            #4.6698        6.1348        5.7449        4.9430        5.5237
12            #4.6395        6.2541        5.7378        4.9452        5.5239
13            #4.6905        6.3761        5.7373        4.9482        5.5085
14            #4.6700        6.4692        5.8173        4.9447        5.5006
15            #4.6643        6.5548        5.7426        4.9644        5.4958
16            #4.6809        6.6305        5.7439        4.9625        5.4924
17            #4.6800        6.6901        5.7418        4.9576        5.4760
18            #4.6903        6.7440        5.7436        4.9866        5.4840
19            #4.6886        6.7891        5.7366        4.9753        5.4872
20            #4.6818        6.8370        5.7405        4.9783        5.4820
              asm          method 1      method 2      method 3      method 4

So on my laptop, method 3 wins, just 0.3 cycles/limb behind the assembly
implementation. 

It's also curius that method 1 wins for the smaller operands, maybe
that's the cost of the B2 = -d * dinv ? I think assembly implementation
of method 3 may get register challenged (but I haven't loked at what the
compiler generates). To reduce number of live registers, one might need
to complete the u accumulation before the q multiply.

Regards,
/Niels

-------------- next part --------------
A non-text attachment was scrubbed...
Name: div_qr_1n_pi1.patch
Type: text/x-diff
Size: 12492 bytes
Desc: not available
URL: <https://gmplib.org/list-archives/gmp-devel/attachments/20210606/821d8bc7/attachment.bin>
-------------- next part --------------

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.