# Floating-point representation

David M. Warme David at Warme.net
Wed May 19 13:18:33 UTC 2021

```Paul's idea of making the least significant bit be implicit
would be reasonable in the case where the mantissa has
fixed precision, but not when the mantissa is mpz_t, where
dealing with the missing 1 bit provokes much unnecessary
pain.

I use these in a polynomial real root-finding code that
uses Sturm sequences.  I scale the initial floating-point
polynomial coefficients to integers and remove any common
integer factor.  To avoid the difficulties of multiple
roots, it then uses Yun's algorithm to compute a square-free
factorization, with the added bonus of correctly determining
the exact multiplicity of each real root.

The Sturm sequence (of each square-free factor) can then be
expressed as exact polynomials in Z[x].  This special
floating-point representation is then very convenient when
evaluating each of the polynomials of the Sturm sequence at
any particular real value x during the search phase.  Only
add and multiply are needed, and there is no rounding or
other numeric error of any kind -- the correct sign of each
polynomial value is guaranteed.

The search partitions the real line: (-infinity,-1), -1,
(-1,0), 0, (0, 1), 1, (1, +infinity).  Within each range,
it first uses binary search on the exponent, and then on
the mantissa.  Once a root has been isolated, it is refined
using Newton's method, correcting with a few iterations of
bisection whenever Newton misbehaves (e.g., by leaving the
isolating interval, which shrinks as Newton / bisection
proceed).

This is all prefaced by a purely hardware floating-point
implementation that uses "long double" interval arithmetic
to assure that there is no ambiguity regarding the sign of
any Sturm sequence coefficient or value.  The code switches
to the "mpz_t" version if any of these problems manifest.

Works like a charm, and the amortized computational cost
is quite good.  The accuracy, of course, is superb.  The
mpz_t version guarantees the correctly rounded
"long double" root.  The hardware floating-point version
essentially always finds the value x such the |f(x)| is
the smallest possible "long double" result.  (What this
means regarding the true accuracy of x is, of course,
"complicated".)

On 5/18/21 5:53 AM, Marc Glisse wrote:
> On Thu, 13 May 2021, David M. Warme wrote:
>
>> Consider the floating-point representation
>>
>>    m * b**e
>>
>> where m, b and e are integers and b >= 2 is a fixed constant.
>> (The mantissa m could be a GMP integer, and e a fixed-precision
>> signed integer.)
>>
>> The normalization rule is that the mantissa m is either zero
>> or not divisible by b.  (For the usual case of b = 2, the
>> mantissa m is either zero or an odd integer.)
>>
>> This representation supports add, subtract and multiply with
>> no rounding error.
>>
>> Q1: Does this representation have a well-defined name (especially
>>    for the b = 2 case)?
>>
>> Q2: If so, does anyone have a reference in the literature?
>
> No idea about references, but I think that's what I implemented in
> CGAL as CGAL::Mpzf, with b = 2**64, where we already had CGAL::Gmpzf
> with b=2.
>
```