More div_qr_2 assembler

Wed Mar 30 23:35:00 CEST 2011

I've hacked a mpn_div_qr_2u_pi1 (based on the previous mpn_div_qr_2n_pi1 which
handles the normalized case). Attached below.

Results on my core2 laptop:

  $ ./speed -o cycles-broken -p 100000 -C -s3-10,50,200,500,1500 mpn_divrem_2 mpn_div_qr_2n mpn_div_qr_2u
  clock_gettime is 1.000ns accurate
  overhead 6.04 cycles, precision 100000 units of 1.00e-09 secs, CPU freq 1200.00 MHz
           mpn_divrem_2 mpn_div_qr_2n mpn_div_qr_2u
  3            #36.0929       41.9874       58.5886
  4            #35.9422       39.5447       51.7249
  5            #35.0578       38.5444       47.5463
  6            #36.0060       37.4358       45.4954
  7            #35.1112       36.6949       43.4555
  8            #34.5703       36.2022       42.9322
  9            #34.5919       36.1517       41.4659
  10           #34.7529       37.0281       40.9467
  50            34.5641      #33.4350       35.3366
  200           34.3937      #33.0381       34.3683
  500           34.2357      #32.8356       33.4413
  1500          34.0592      #30.2301       30.9165

$ ./speed -o cycles-broken -p 100000 -c -s3 mpn_divrem_2 mpn_div_qr_2n mpn_div_qr_2u
clock_gettime is 1.000ns accurate
overhead 6.04 cycles, precision 100000 units of 1.00e-09 secs, CPU freq 1200.00 MHz
         mpn_divrem_2 mpn_div_qr_2n mpn_div_qr_2u
3             #109.79        125.96        172.92

So for the unnormalized case, we get an additional constant overhead of
47 cycles (the main cost here is the more complicated handling of qh,
which may be almost a full limb), and then the normalization seems to
cost a cycle per quotient limb. Loopmixing both the normalized and
unnormalized loops might be worthwhile.

Current code uses shld (and shrd at the end). This means that the
normalized function could set the shift count to zero and jump to the
unnormalized loop after having determined qh, reducing code size a bit.

But since SHLD is slow on some processors, we'd need code without shld
(depending on the SHLD_SLOW configure variable), and then it gets harder
to handle a zero shift count.

As Torbjörn has observed, the current interface where qh is returned
separately leads to a bit of code duplication in the unnormalized case.

/nisse

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: div_qr_2u_pi1.asm
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20110330/c7f256a9/attachment.ksh>
-------------- next part --------------

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.