Floating-point round trips

I often need to display or record output of floating-point values, in a human-readable decimal format.  This is easy enough in most programming languages… unless you need the representation to be one-to-one, so that distinct finite values always produce distinct output.  More precisely, we would like to be able to (1) convert a floating-point value x to a string, then (2) convert the string back to floating-point, recovering exactly the original value x.

For example, I use MATLAB more than I care to, and despite its primary application being numeric scientific computing, none of its built-in output formats support this.  The simplest way that I know of to do this is actually via MATLAB’s interface to Java:

>> java.lang.Double.toString(1.1 + 2.2 - 3.3)

In C++, you can set the precision of an output stream to a desired number of decimal digits.  In C++11, the constant max_digits10 is defined just for this purpose:

#include <sstream> // ostringstream
#include <iomanip> // setprecision
#include <limits> // numeric_limits

template<typename T>
std::string to_string(T x)
    std::ostringstream os;
    os << std::setprecision(std::numeric_limits<T>::max_digits10) << x;
    return os.str();

Unfortunately, I learned the hard way that Visual Studio 2010 gets this wrong, at least for single-precision float, defining max_digits10 in that case to be 8 instead of 9 as it should be.

For compatibility with older compilers– either with bugs like VS 2010, or simply pre-C++11– the good news is that we can compute the required number of digits d, as a function of the number of significant bits b in the underlying floating-point representation (which is in turn defined in the standard header <limits> as numeric_limits<T>::digits):

d = \left \lceil{b \log_{10}(2)}\right \rceil + 1

where b=24 for float and b=53 for double, yielding d=9 and d=17, respectively.  (Since I can’t seem to write a post without including an obligatory puzzle: why is the +1 needed in the above equation?)

Interestingly, the following alternative formula also works, with the added benefit that it can be evaluated at compile time:

max_digits10 = std::numeric_limits<T>::digits * 3010 / 10000;

I’m not sure why this particular approximation of \log_{10}(2) is used, since 301/1000 works equally well; both are correct for any number of bits less than 196.

(Edit: The last time I wrote about floating-point issues, I mentioned Bruce Dawson’s Random ASCII blog, in particular the excellent series of articles about floating point.  I should have done so here as well.  In this case, the specific relevant article is here… but instead of the brute force enumeration approach described there explaining the “+1” in the formula above– and how often it is actually needed– my motivation for this post, in the form of the “puzzle,” was to suggest a pencil-and-paper derivation of that formula, which would in turn yield a direct calculation of how often that +1 is needed.)


This entry was posted in Uncategorized. Bookmark the permalink.

3 Responses to Floating-point round trips

  1. skeetodev says:

    I believe this covers the case for Matlab: sprintf(‘%.16e’, x). You can use scanf(str, ‘%f’) to read it back in to the exact same value. If it relies on the underling Java routines (rather than, say, MSVCRT on Windows), it should be pretty reliable.

    The same is true for conforming C/C++ implementations using almost the same printf() specifier and strtod(). Unfortunately, there are many broken strtod()’s out there, including, as you noticed, pretty much anything produced by Microsoft.



    That Exploring Binary blog has lots of discussion on this topic since he’s probably one of the leading experts.

    • I suppose I should have more explicitly added something like “efficient” to the list of qualifiers that are harder-than-expected to achieve in MATLAB. By that I mean that the usual C-style format specification you describe *always* includes those additional 16 digits after the decimal point, even when they aren’t necessary. For example, we would like to display the integral value 1 as, well, 1, or at worst 1.0, not 1.0000000000000000e+000 :). In C++, setprecision() achieves this, and in MATLAB, “format long g” is my default output format, which “almost” always does the right thing… and it’s that “almost” that is frustrating.

      You also make a good point that we are depending on an accurate conversion back in the other direction (which is the main focus of the Exploring Binary post).

  2. Pingback: Floating-point round trips revisited | Possibly Wrong

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.