Serializing MATLAB data

Consider the following problem: given a value in the MATLAB programming language, can we serialize it into a sequence of bytes– suitable for, say, storage on disk– in a form that allows easy recovery of the exact original value?

Although I will eventually try to provide an actual solution, the primary motivation for this post is simply to point out some quirks and warts of the MATLAB language that make this problem surprisingly difficult to solve.

“Binary” serialization

Our problem requires a bit of clarification, since there are at least a couple of different reasonable use cases.  First, if we can work with a stream of arbitrary opaque bytes– for example, if we want to send and receive MATLAB data on a TCP socket connection– then there is actually a very simple and robust built-in solution… as long as we’re comfortable with undocumented functionality.  The function b=getByteStreamFromArray(v) converts a value to a uint8 array of bytes, and v=getArrayFromByteStream(b) converts back.  This works on pretty much all types of data I can think of to test, even Java- and user-defined class instances.

Text serialization

But what if we would like something human-readable (and thus potentially human-editable)?  That is, we would like a function similar to Python’s repr, that converts a value to a char string representation, so that eval(repr(v)) “equals” v.  (I say “‘equals'” because even testing such a function is hard to do in MATLAB.  I suppose the built-in function isequaln is the closest approximation to what we’re looking for, but it ignores type information, so that isequaln(int8(5), single(5)), for example.)

Without further ado, following is my attempt at such an implementation, to use as you wish:

function s = repr(v)
%REPR Return string representation of value such that eval(repr(v)) == v.
%
%   Class instances, NaN payloads, and function handle closures are not
%   supported.

    if isstruct(v)
        s = sprintf('cell2struct(%s, %s)', ...
            repr(struct2cell(v)), repr(fieldnames(v)));
    elseif isempty(v)
        sz = size(v);
        if isequal(sz, [0, 0])
            if isa(v, 'double')
                s = '[]';
            elseif ischar(v)
                s = '''''';
            elseif iscell(v)
                s = '{}';
            else
                s = sprintf('%s([])', class(v));
            end
        elseif isa(v, 'double')
            s = sprintf('zeros(%s)', mat2str(sz, 17));
        elseif iscell(v)
            s = sprintf('cell(%s)', mat2str(sz, 17));
        else
            s = sprintf('%s(zeros(%s))', class(v), mat2str(sz, 17));
        end
    elseif ~ismatrix(v)
        nd = ndims(v);
        s = sprintf('cat(%d, %s)', nd, strjoin(cellfun(@repr, ...
            squeeze(num2cell(v, 1:(nd - 1))).', ...
            'UniformOutput', false), ', '));
    elseif isnumeric(v)
        if ~isreal(v)
            s = sprintf('complex(%s, %s)', repr(real(v)), repr(imag(v)));
        elseif isa(v, 'double')
            s = strrep(repr_matrix(@arrayfun, ...
                @(x) regexprep(char(java.lang.Double.toString(x)), ...
                '\.0$', ''), v, '[%s]', '%s'), 'inity', '');
        elseif isfloat(v)
            s = strrep(repr_matrix(@arrayfun, ...
                @(x) regexprep(char(java.lang.Float.toString(x)), ...
                '\.0$', ''), v, '[%s]', 'single(%s)'), 'inity', '');
        elseif isa(v, 'uint64') || isa(v, 'int64')
            t = class(v);
            s = repr_matrix(@arrayfun, ...
                @(x) sprintf('%s(%s)', t, int2str(x)), v, '[%s]', '%s');
        else
            s = mat2str(v, 'class');
        end
    elseif islogical(v) || ischar(v)
        s = mat2str(v);
    elseif iscell(v)
        s = repr_matrix(@cellfun, @repr, v, '%s', '{%s}');
    elseif isa(v, 'function_handle')
        s = sprintf('str2func(''%s'')', func2str(v));
    else
        error('Unsupported type.');
    end
end

function s = repr_matrix(map, repr_scalar, v, format_matrix, format_class)
    s = strjoin(cellfun(@(row) strjoin(row, ', '), ...
        num2cell(map(repr_scalar, v, 'UniformOutput', false), 2).', ...
                                     'UniformOutput', false), '; ');
    if ~isscalar(v)
        s = sprintf(format_matrix, s);
    end
    s = sprintf(format_class, s);
end

That felt like a lot of work… and that’s only supporting the “plain old data” types: struct and cell arrays, function handles, logical and character arrays, and the various floating-point and integer numeric types.  As the help indicates, Java and classdef instances are not supported.  A couple of other cases are only imperfectly handled as well, as we’ll see shortly.

Struct arrays

The code starts with struct arrays.  The tricky issue here is that struct arrays can not only be “empty” in the usual sense of having zero elements, but also– independently of whether they are empty– they can have no fields.  It turns out that the struct constructor, which would work fine for “normal” structures with one or more fields, has limited expressive power when it comes to field-less struct arrays: unless the size is 1×1 or 0x0, some additional concatenation or reshaping is required.  Fortunately, cell2struct handles all of these cases directly.

Multi-dimensional arrays

Next, after handling the tedious cases of empty arrays of various types, the ~ismatrix(v) test handles multi-dimensional arrays– that is, arrays with more than 2 dimensions.  I could have handled this with reshape instead, but I think this recursive concatenation approach does a better job of preserving the “visual shape” of the data.

In the process of testing this, I learned something interesting about multi-dimensional arrays: they can’t have trailing singleton dimensions!  That is, there are 1×1 arrays, and 2×1 arrays, even 1x2x3 and 2x1x3 arrays… but no matter how hard I try, I cannot construct an mxnx1 array, or an mxnxkx1 array, etc.  MATLAB seems to always “squeeze” trailing singleton dimensions automagically.

Numbers

The isnumeric(v) section is what makes this problem almost comically complicated.  There are 10 different numeric types in MATLAB: double and single precision floating point, and signed and unsigned 8-, 16-, 32-, and 64-bit integers.  Serializing arrays of these types should be the job of the built-in function mat2str, which we do lean on here, but only for the shorter integer types, since it fails in several ways for the other numeric types.

First, the nit-picky stuff: I should emphasize that my goal is “round-trip” reproducibility; that is, after converting to string and back, we want the underlying bytes representing the numeric values to be unchanged.  Precision is one issue: for some reason, MATLAB’s default seems to be 15 decimal digits, which isn’t enough– by two— to accurately reproduce all double precision values.  Granted, this is an optional argument to mat2str, which effectively uses sprintf('%.17g',x) under its hood, but Java’s algorithm does a better job of limiting the number of digits that are actually needed for any given value.

Other reasons to bypass mat2str are that (1) for some reason it explicitly “erases” negative zero, and (2) it still doesn’t quite accurately handle complex numbers involving NaNalthough it has improved in recent releases.  Witness eval(mat2str(complex(0, nan))), for example.  (My implementation isn’t perfect here, either, though; there are multiple representations of NaN, but this function strips any payload.)

But MATLAB’s behavior with 64-bit integer types is the most interesting of all, I think.  Imagine things from the parser’s perspective: any numeric literal defaults to double precision, which, without a decimal point or fractional part, we can think of as “almost” an int54.  There is no separate syntax for integer literals; construction of “literal” values of the shorter (8-, 16-, and 32-bit) integer types effectively casts from that double-precision literal to the corresponding integer type.

But for uint64 and int64, this doesn’t work… and for a while (until around R2010a), it really didn’t work– there was no way to directly construct a 64-bit integer larger than 2^53, if it wasn’t a power of two!

This behavior has been improved somewhat since then, but at the expense of added complexity in the parser: the expression [u]int64(expr) is now a special case, as long as expr is an integer literal, with no arithmetic, imaginary part, etc.  Even so much as a unary plus will cause a fall back to the usual cast-from-double.  (It appears that Octave, at least as of version 4.0.3, has not yet worked this out.)

The effect on this serialization function is that we have to wrap that explicit uint64 or int64 construction around each individual integer scalar, instead of a single cast of the entire array expression as we can do with all of the other numeric types.

Function handles

Finally, function handles are also special.  First, they must be scalar (i.e., 1×1), most likely due to the language syntax ambiguity between array indexing and function application.  But function handles also can have workspace variables associated with them– usually when created anonymously– and although an existing function handle and its associated workspace can be inspected, there does not appear to be a way to create one from scratch in a single evaluatable expression.

 

7 thoughts on “Serializing MATLAB data

  1. Oh boy, this is an area I’ve also extensively researched myself!

    > This behavior has been improved somewhat since then, but at the expense of added complexity in the parser: the expression [u]int64(expr) is now a special case, as long as expr is an integer literal, with no arithmetic, imaginary part, etc.

    Would you believe I’m the one who, almost a year ago, started this conversation on the Octave bug tracker? 🙂 It’s still an active discussion, should you want to weigh in. It’s another great example of Matlab’s ad hoc language design.

    http://savannah.gnu.org/bugs/?45945

    The official documentation for the .mat format (level 5) is very revealing on the nature of Matlab’s types. I believe it’s essentially a (neat and tidy) memory dump, or at least it used to be at some point. For example, it wasn’t until I read this that I realized Matlab doesn’t actually have a scalar type. Everything, even what appears to be a scalar, is really a matrix/array of at least two dimensions, even if it’s 1×1, 0x0, or 0x100. This is related to the dimension squeezing you’re seeing.

    Click to access matfile_format.pdf

    As for closures, I honestly don’t think they’re worth trying to serialize. If they close over variables, you won’t be able to get to their values, especially because they’re not necessarily workspace variables. You also won’t know it actually closes over a variable unless you parse the function string and search for free variables in the AST, which is obviously very non-trivial.

    Here are my thoughts on this matter in Emacs Lisp:

    http://nullprogram.com/blog/2013/12/30/

    And when I did it for JavaScript, I completely punted on serializing closures/functions. It just throws an error and quits:

    http://nullprogram.com/blog/2013/03/28/

    Fortunately for you, Matlab doesn’t support circular data structures. One less thing to worry about.

  2. I forgot to add one more fun fact: Matlab doesn’t strictly enforce well-formedness when loading .mat files. You can (ab)use this by crafting your own .mat files that load odd-structures values that cannot otherwise be constructed from within the language itself. This *may* include matrices with trailing singleton dimensions, but I’d have to test it out.

    • Thanks for the link to the .mat documentation, I don’t think I had seen that before. Looking through it, I don’t see anything explicit (in section 1-17) about trailing singleton dimensions, so I’m not sure where the squeezing is happening. For example, I have tried doing this from a MEX function with mxCreateNumericArray(3, {2,2,1}, mxDOUBLE_CLASS, mxREAL)… but even before returning out of the function, mxGetNumberOfDimensions() already returns 2 instead of 3. Would be interesting to see what happens from an “abused” .mat file.

      > Fortunately for you, Matlab doesn’t support circular data structures. One less thing to worry about.

      It does, actually– I just didn’t bother even trying to mess with it :). Handle class instance variables are effectively references/pointers, but I skipped user-defined classes altogether.

      • Here’s one of these crafted .mat files. It has a variable with an illegal name and a struct with illegal field names, but Matlab loads it just fine. You’ll be able to read from but not write to these fields, and accessing the variable will require some cleverness.

        http://skeeto.s3.amazonaws.com/share/weirdo.mat

        Prior to 2014 or so, Matlab used signed integers internally when operating on metadata but would validate incoming .mat metadata values as unsigned. It was pretty easy to create .mat files that would immediately crash Matlab on load. I did it a lot by accident, so I was happy to have it fixed.

        The variable “strange_dims” is defined as 5 dimensions (2x1x1x1x1) but Matlab squeezes it, either as part of loading or maybe on first use. I can’t trick it into leaving trailing singletons.

        As for squeezing, consider the way a matrix is structured in a .mat file. The matrix is specified as the number of dimensions N, an array of N integers, and data itself stored in a big flat buffer. Squeezing tailing singletons is a simple matter of decrementing N until they go away. It’s so trivial that I think Matlab does it basically every access. Perhaps its a historical artifact.

  3. Very interesting! I can’t seem to get quite as far (just tested on Windows R2016a); the return-value form of load(), i.e. “s = load(‘weirdo.mat’)”, fails with an “Error using load, Invalid field name: ‘*-trickster-*’.” The nargout=0 form *does* execute without any errors, but only strange_dims is visible in the workspace.

    (I thought I might have better luck with h=matfile(‘weirdo.mat’); whos(h), to try to “inspect” the file without actually trying to load it. No luck, got a more verbose error “‘*-trickster-*’ is not a valid dynamic property name blahblahblah.”)

  4. Pingback: Floating-point agreement between MATLAB and C++ | Possibly Wrong

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.