Some div_qr_2 assembler
Niels Möller
nisse at lysator.liu.se
Tue Mar 22 22:11:54 CET 2011
nisse at lysator.liu.se (Niels Möller) writes:
> Well, problem was that it wasn't even clear to me if divrem_2 used the
> current 3/2 method or some different strategy. With a working 3/2
> implementation which I understand to compare with, it will be easier to
> figure things out, and then I can copy the inner loop.
Now I've copied the divrem_2 innerloop, sans the fraction things (and
with named registers, so that I can read it...), and pushed the result.
Seems to run a cycle and a half faster per limb than divrem_2. A bit
worse overhead; there's one additional function call, and maybe
something else which is stupid and should be improved.
Case n == 2 is awful, this ought to be handled in the calling function,
mpn_div_qr_2, which can decide that it's no use to compute the inverse
in this case. Maybe also n == 3 should be handled specially, without
computing an inverse.
$ ./speed -C -s 2-15,50,200,500 mpn_div_qr_2_norm mpn_divrem_2
overhead 6.13 cycles, precision 10000 units of 8.33e-10 secs, CPU freq 1200.00 MHz
mpn_div_qr_2_norm mpn_divrem_2
2 51.6286 #6.7213
3 40.6707 #37.1871
4 38.5699 #36.1507
5 37.3071 #37.0036
6 37.0347 #35.6633
7 36.5000 #35.5216
8 36.3649 #35.1711
9 36.0236 #35.4510
10 35.6633 #35.2032
11 36.5253 #36.4613
12 35.1026 #34.8942
13 34.9776 #34.8942
14 #34.9006 35.4968
15 #34.7905 34.8730
50 #33.7767 34.9033
200 #33.3925 34.6600
500 #33.0980 34.4500
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list