## Floating-point round trips revisited

After a couple of interesting conversations on the subject, I think I should try to clear up some potentially confusing comments I made in my last post about converting binary floating-point values to decimal strings… and back to binary floating point again.

Injection != round-trip identity

“… You need the representation to be one-to-one, so that distinct finite values always produce distinct output.  More precisely, we would like to be able to (1) convert a floating-point value $x$ to a string, then (2) convert the string back to floating-point, recovering exactly the original value $x$.”

In the above quote, “more precisely” is a misleading choice of words, since it suggests that these two properties (one-to-one-ness and round-tripping) are equivalent, when they are not.  For a simplest possible example, consider the correctly rounded floating-point conversion from single-bit binary to single-digit decimal.  This conversion is one-to-one: distinct powers of two always map to distinct single-digit multiples of powers of 10.  However, the round-trip conversion, from binary to decimal and back to correctly rounded decimal, is not the identity.  It’s an exercise for the reader to verify that the value $2^{27}$, used by Matula in Reference (2) below, is a minimal counterexample.

To see why this distinction matters, consider the more general problem of the correctly rounded conversion from $n$-digit base $\beta$ to $m$-digit base $\nu$… where $\beta$ and $\nu$ are not “variants of a common base” (more on this later).  Then as shown in Reference (1), the resulting function is one-to-one if and only if

$\nu^{m-1} \geq \beta^n - 1$

However, we want more than the ability to simply invert this one-to-one conversion function: instead, we would like to be able to compose this conversion function with another correctly rounded conversion back in the other direction, and yield the identity.  As shown in Reference (2), this composition yields the identity if and only if the following stronger condition holds:

$\nu^{m-1} > \beta^n$

Fortunately, in the case where $\beta=2$ and $\nu=10$, the single-bit-to-single-digit wrinkle mentioned above (i.e., $n=m=1$) is the only situation where these two conditions are not equivalent, and so in practice when using floats and doubles this distinction is less important.

Variants of a common base

It’s another exercise for the reader to verify that the second inequality above may be solved for a minimal number $m$ of decimal digits required for a successful round-trip conversion from $n$-bit binary, to yield the formula from the last post:

$m_{min} = \lceil{n \log_{10}(2)}\rceil + 1$

However, there is an important detail involved in turning that strict inequality into this ceiling expression, that is left out of many discussions that I see online (at least where an attempt is made to generalize beyond binary-decimal conversions): this only works since $n \log_{10}(2)$ is not an integer.  That is, base 2 and base 10 are not “variants [i.e., powers] of a common base,” as discussed in the references below.  For conversions between bases that are powers of some common base, these formulas do not hold in general.  (Consider various conversions between binary and hexadecimal, for example, again as discussed in the references below.)

References:

1. Matula, David W., The Base Conversion Theorem, Proceedings of the American Mathematical Society, 19(3) June 1968, p. 716-723 [PDF]
2. Matula, David W., In-and-Out Conversions, Communications of the ACM, 11(1) January 1968, p. 47-50 [ACM]

This entry was posted in Uncategorized. Bookmark the permalink.

### One Response to Floating-point round trips revisited

This site uses Akismet to reduce spam. Learn how your comment data is processed.