More div_qr_2 assembler

Thu Mar 31 09:13:14 CEST 2011

nisse at lysator.liu.se (Niels Möller) writes:

    $ ./speed -o cycles-broken -p 100000 -C -s3-10,50,200,500,1500 mpn_divrem_2 mpn_div_qr_2n mpn_div_qr_2u
    clock_gettime is 1.000ns accurate
    overhead 6.04 cycles, precision 100000 units of 1.00e-09 secs, CPU freq 1200.00 MHz
             mpn_divrem_2 mpn_div_qr_2n mpn_div_qr_2u
    3            #36.0929       41.9874       58.5886
    4            #35.9422       39.5447       51.7249
    5            #35.0578       38.5444       47.5463
    6            #36.0060       37.4358       45.4954
    7            #35.1112       36.6949       43.4555
    8            #34.5703       36.2022       42.9322
    9            #34.5919       36.1517       41.4659
    10           #34.7529       37.0281       40.9467
    50            34.5641      #33.4350       35.3366
    200           34.3937      #33.0381       34.3683
    500           34.2357      #32.8356       33.4413
    1500          34.0592      #30.2301       30.9165

  $ ./speed -o cycles-broken -p 100000 -c -s3 mpn_divrem_2 mpn_div_qr_2n mpn_div_qr_2u
  clock_gettime is 1.000ns accurate
  overhead 6.04 cycles, precision 100000 units of 1.00e-09 secs, CPU freq 1200.00 MHz
           mpn_divrem_2 mpn_div_qr_2n mpn_div_qr_2u
  3             #109.79        125.96        172.92

  So for the unnormalized case, we get an additional constant overhead of
  47 cycles (the main cost here is the more complicated handling of qh,
  which may be almost a full limb), and then the normalization seems to
  cost a cycle per quotient limb. Loopmixing both the normalized and
  unnormalized loops might be worthwhile.

How can you time all three using the same speed command, given that the
2u variant don't accept the same divisors as the other two?

To fairly compare divrem_2--now used for all divisors but with
preshifting of the operands--to the new shift-on-the-fly code, I think
we should include the time for shifting.  Perhaps add a new target to
tune/speed doing exactly that?

The slowdown for normalised operands is more worrying.  Except that I
primarily worry for GMP high-end processors in this case, i.e., AMD
K8-K10 and to slightly lesser Intel Sandy Bridge.

  Current code uses shld (and shrd at the end). This means that the
  normalized function could set the shift count to zero and jump to the
  unnormalized loop after having determined qh, reducing code size a bit.

  But since SHLD is slow on some processors, we'd need code without shld
  (depending on the SHLD_SLOW configure variable), and then it gets harder
  to handle a zero shift count.

SHLD_SLOW means "SHLD_SUPERSLOW"; some machines, e.g., AMD K8-K10, have
slow but not superslow SHLD.  (Of the 64-bit processors, Intel's atom
and VIA's nano have superslow SHLD.)

-- 
Torbjörn