Floating-point round trips

I often need to display or record output of floating-point values, in a human-readable decimal format. This is easy enough in most programming languages… unless you need the representation to be one-to-one, so that distinct finite values always produce distinct output. More precisely, we would like to be able to (1) convert a floating-point value $x$ to a string, then (2) convert the string back to floating-point, recovering exactly the original value $x$ .

For example, I use MATLAB more than I care to, and despite its primary application being numeric scientific computing, none of its built-in output formats support this. The simplest way that I know of to do this is actually via MATLAB’s interface to Java:

>> java.lang.Double.toString(1.1 + 2.2 - 3.3)

In C++, you can set the precision of an output stream to a desired number of decimal digits. In C++11, the constant max_digits10 is defined just for this purpose:

#include <sstream> // ostringstream
#include <iomanip> // setprecision
#include <limits> // numeric_limits

template<typename T>
std::string to_string(T x)
{
    std::ostringstream os;
    os << std::setprecision(std::numeric_limits<T>::max_digits10) << x;
    return os.str();
}

Unfortunately, I learned the hard way that Visual Studio 2010 gets this wrong, at least for single-precision float, defining max_digits10 in that case to be 8 instead of 9 as it should be.

For compatibility with older compilers– either with bugs like VS 2010, or simply pre-C++11– the good news is that we can compute the required number of digits $d$ , as a function of the number of significant bits $b$ in the underlying floating-point representation (which is in turn defined in the standard header <limits> as numeric_limits<T>::digits):

$d = \left \lceil{b \log_{10}(2)}\right \rceil + 1$

where $b=24$ for float and $b=53$ for double, yielding $d=9$ and $d=17$ , respectively. (Since I can’t seem to write a post without including an obligatory puzzle: why is the +1 needed in the above equation?)

Interestingly, the following alternative formula also works, with the added benefit that it can be evaluated at compile time:


max_digits10 = std::numeric_limits<T>::digits * 3010 / 10000;

I’m not sure why this particular approximation of $\log_{10}(2)$ is used, since 301/1000 works equally well; both are correct for any number of bits less than 196.

(Edit: The last time I wrote about floating-point issues, I mentioned Bruce Dawson’s Random ASCII blog, in particular the excellent series of articles about floating point. I should have done so here as well. In this case, the specific relevant article is here… but instead of the brute force enumeration approach described there explaining the “+1” in the formula above– and how often it is actually needed– my motivation for this post, in the form of the “puzzle,” was to suggest a pencil-and-paper derivation of that formula, which would in turn yield a direct calculation of how often that +1 is needed.)

3 thoughts on “Floating-point round trips”

skeetodev on June 21, 2015 at 8:24 pm said:

I believe this covers the case for Matlab: sprintf(‘%.16e’, x). You can use scanf(str, ‘%f’) to read it back in to the exact same value. If it relies on the underling Java routines (rather than, say, MSVCRT on Windows), it should be pretty reliable.

The same is true for conforming C/C++ implementations using almost the same printf() specifier and strtod(). Unfortunately, there are many broken strtod()’s out there, including, as you noticed, pretty much anything produced by Microsoft.

http://stackoverflow.com/a/19897395

http://www.exploringbinary.com/visual-c-plus-plus-strtod-still-broken/

That Exploring Binary blog has lots of discussion on this topic since he’s probably one of the leading experts.

Reply ↓
- possiblywrong on June 21, 2015 at 8:51 pm said:
  
  I suppose I should have more explicitly added something like “efficient” to the list of qualifiers that are harder-than-expected to achieve in MATLAB. By that I mean that the usual C-style format specification you describe *always* includes those additional 16 digits after the decimal point, even when they aren’t necessary. For example, we would like to display the integral value 1 as, well, 1, or at worst 1.0, not 1.0000000000000000e+000 :). In C++, setprecision() achieves this, and in MATLAB, “format long g” is my default output format, which “almost” always does the right thing… and it’s that “almost” that is frustrating.
  
  You also make a good point that we are depending on an accurate conversion back in the other direction (which is the main focus of the Exploring Binary post).
  
  Reply ↓
Pingback: Floating-point round trips revisited | Possibly Wrong

Possibly Wrong

On science, mathematics, and computing

Floating-point round trips

3 thoughts on “Floating-point round trips”

Leave a comment Cancel reply

Share this:

Related

3 thoughts on “Floating-point round trips”

Leave a comment Cancel reply