div_qr_1n_pi1
Niels Möller
nisse at lysator.liu.se
Sun Jun 6 18:18:36 UTC 2021
nisse at lysator.liu.se (Niels Möller) writes:
> You're idea of conditonally adding the invariant d * B2 at the right
> place is also interesting,
I've tried it out. Works nicely, but no speedup on my machine. I'm
attaching another patch. There are then 4 methods:
method 1: Old loop around udiv_qrnnd_preinv.
method 2: The clever code from 10 years ago, with the microoptimization
I commited the other day.
method 3: More or less the same as I posted a few days ago.
method 4: Postpones the update u1 -= u2 d, off the critical recurrency
chain. Instead, conditionally adds in the constant B2 (B - d) to the
lower u limbs.
Benchmark on my laptop:
$ ./speed -p 1000000 -s 2-20 -C mpn_div_qr_1n_pi1.0x8765432108765432 mpn_div_qr_1n_pi1_1.0x8765432108765432 mpn_div_qr_1n_pi1_2.0x8765432108765432 mpn_div_qr_1n_pi1_3.0x8765432108765432 mpn_div_qr_1n_pi1_4.0x8765432108765432
overhead 2.63 cycles, precision 1000000 units of 7.16e-10 secs, CPU freq 1396.05 MHz
mpn_div_qr_1n_pi1.0x8765432108765432 mpn_div_qr_1n_pi1_1.0x8765432108765432 mpn_div_qr_1n_pi1_2.0x8765432108765432 mpn_div_qr_1n_pi1_3.0x8765432108765432 mpn_div_qr_1n_pi1_4.0x8765432108765432
2 4.4566 #3.9635 5.8490 5.4085 6.0087
3 4.4323 #4.2441 5.8115 5.1158 5.7832
4 4.5348 #4.3992 5.7306 5.0807 5.8611
5 #4.5534 4.6698 5.7493 4.9803 5.5605
6 #4.5653 4.8412 5.7497 4.9129 5.6516
7 #4.6069 5.1388 5.7388 4.8811 5.6110
8 #4.6202 5.5073 5.7423 4.9359 5.5695
9 #4.6433 5.7341 5.7537 4.9357 5.5407
10 #4.6436 5.9595 5.7400 4.9231 5.5428
11 #4.6698 6.1348 5.7449 4.9430 5.5237
12 #4.6395 6.2541 5.7378 4.9452 5.5239
13 #4.6905 6.3761 5.7373 4.9482 5.5085
14 #4.6700 6.4692 5.8173 4.9447 5.5006
15 #4.6643 6.5548 5.7426 4.9644 5.4958
16 #4.6809 6.6305 5.7439 4.9625 5.4924
17 #4.6800 6.6901 5.7418 4.9576 5.4760
18 #4.6903 6.7440 5.7436 4.9866 5.4840
19 #4.6886 6.7891 5.7366 4.9753 5.4872
20 #4.6818 6.8370 5.7405 4.9783 5.4820
asm method 1 method 2 method 3 method 4
So on my laptop, method 3 wins, just 0.3 cycles/limb behind the assembly
implementation.
It's also curius that method 1 wins for the smaller operands, maybe
that's the cost of the B2 = -d * dinv ? I think assembly implementation
of method 3 may get register challenged (but I haven't loked at what the
compiler generates). To reduce number of live registers, one might need
to complete the u accumulation before the q multiply.
Regards,
/Niels
-------------- next part --------------
A non-text attachment was scrubbed...
Name: div_qr_1n_pi1.patch
Type: text/x-diff
Size: 12492 bytes
Desc: not available
URL: <https://gmplib.org/list-archives/gmp-devel/attachments/20210606/821d8bc7/attachment.bin>
-------------- next part --------------
--
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list