Some div_qr_2 assembler

Tue Mar 22 22:11:54 CET 2011

nisse at lysator.liu.se (Niels Möller) writes:

> Well, problem was that it wasn't even clear to me if divrem_2 used the
> current 3/2 method or some different strategy. With a working 3/2
> implementation which I understand to compare with, it will be easier to
> figure things out, and then I can copy the inner loop.

Now I've copied the divrem_2 innerloop, sans the fraction things (and
with named registers, so that I can read it...), and pushed the result.

Seems to run a cycle and a half faster per limb than divrem_2. A bit
worse overhead; there's one additional function call, and maybe
something else which is stupid and should be improved.

Case n == 2 is awful, this ought to be handled in the calling function,
mpn_div_qr_2, which can decide that it's no use to compute the inverse
in this case. Maybe also n == 3 should be handled specially, without
computing an inverse.

$ ./speed -C -s 2-15,50,200,500 mpn_div_qr_2_norm mpn_divrem_2
overhead 6.13 cycles, precision 10000 units of 8.33e-10 secs, CPU freq 1200.00 MHz

        mpn_div_qr_2_norm  mpn_divrem_2
2             51.6286       #6.7213
3             40.6707      #37.1871
4             38.5699      #36.1507
5             37.3071      #37.0036
6             37.0347      #35.6633
7             36.5000      #35.5216
8             36.3649      #35.1711
9             36.0236      #35.4510
10            35.6633      #35.2032
11            36.5253      #36.4613
12            35.1026      #34.8942
13            34.9776      #34.8942
14           #34.9006       35.4968
15           #34.7905       34.8730
50           #33.7767       34.9033
200          #33.3925       34.6600
500          #33.0980       34.4500

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.