hgcd1/2

Mon Sep 16 10:28:22 UTC 2019

nisse at lysator.liu.se (Niels Möller) writes:

  > I've written several div1 in asm (arm v5 method 2, 64-bit arm v8 method
  > 2 and 3, 64-bit x86 method 2).  

  Nice.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: x64-div_11-m2.asm
Type: application/octet-stream
Size: 2360 bytes
Desc: not available
URL: <https://gmplib.org/list-archives/gmp-devel/attachments/20190916/5f4f02ef/attachment-0005.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: arm32-div_11-m2.asm
Type: application/octet-stream
Size: 1517 bytes
Desc: not available
URL: <https://gmplib.org/list-archives/gmp-devel/attachments/20190916/5f4f02ef/attachment-0006.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: arm64-div_11-m2.asm
Type: application/octet-stream
Size: 1613 bytes
Desc: not available
URL: <https://gmplib.org/list-archives/gmp-devel/attachments/20190916/5f4f02ef/attachment-0007.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: arm64-div_11-m3.asm
Type: application/octet-stream
Size: 1741 bytes
Desc: not available
URL: <https://gmplib.org/list-archives/gmp-devel/attachments/20190916/5f4f02ef/attachment-0008.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ppc64-div_11-m2.asm
Type: application/octet-stream
Size: 2114 bytes
Desc: not available
URL: <https://gmplib.org/list-archives/gmp-devel/attachments/20190916/5f4f02ef/attachment-0009.obj>
-------------- next part --------------

  > I now believe a table-based thing which gets the quotient right > 90% of
  > the time will be the best approach.

  What should the table look like? If one looks up only the divisor bits,
  (i.e., a reciprocal table) one gets better precision with smaller table
  size. If one looks up bits of both inputs, table gets larger. One would
  then tabulate something like 1uuuuu / 1.ddddd, and shift the result
  depending on the bit size difference.

Both variants might be considered.  It would seem that getting a
quotient directly, i.e., indexing with bits from both dividend and
divisor would be fastest.  But forming such an index might take some
cycles extra compared to indexing with just the divisor.  The extra
multiply needed when indexing with just the divisor might be faster.

We cannot expect the quotient to close to correct when n / d is large,
i.e., a single adjustment step will not be sufficient.  We need to
detect that (perhaps by comparing the count_leading_zeros of the
dividend and divisor) and do a full /.  That'll be rare, so mispredicton
penalty will be low.

-- 
Torbjörn
Please encrypt, key id 0xC8601622