Consider the following problem: given a value in the MATLAB programming language, can we serialize it into a sequence of bytes– suitable for, say, storage on disk– in a form that allows easy recovery of the *exact* original value?

Although I will eventually try to provide an actual solution, the primary motivation for this post is simply to point out some quirks and warts of the MATLAB language that make this problem surprisingly difficult to solve.

**“Binary” serialization**

Our problem requires a bit of clarification, since there are at least a couple of different reasonable use cases. First, if we can work with a stream of *arbitrary* opaque bytes– for example, if we want to send and receive MATLAB data on a TCP socket connection– then there is actually a very simple and robust built-in solution… as long as we’re comfortable with undocumented functionality. The function `b=getByteStreamFromArray(v)`

converts a value to a `uint8`

array of bytes, and `v=getArrayFromByteStream(b)`

converts back. This works on pretty much all types of data I can think of to test, even Java- and user-defined class instances.

**Text serialization**

But what if we would like something human-readable (and thus potentially human-editable)? That is, we would like a function similar to Python’s `repr`

, that converts a value to a `char`

string representation, so that `eval(repr(v))`

“equals” `v`

. (I say “‘equals'” because even *testing* such a function is hard to do in MATLAB. I suppose the built-in function `isequaln`

is the closest approximation to what we’re looking for, but it ignores type information, so that `isequaln(int8(5), single(5))`

, for example.)

Without further ado, following is my attempt at such an implementation, to use as you wish:

function s = repr(v) %REPR Return string representation of value such that eval(repr(v)) == v. % % Class instances, NaN payloads, and function handle closures are not % supported. if isstruct(v) s = sprintf('cell2struct(%s, %s)', ... repr(struct2cell(v)), repr(fieldnames(v))); elseif isempty(v) sz = size(v); if isequal(sz, [0, 0]) if isa(v, 'double') s = '[]'; elseif ischar(v) s = ''''''; elseif iscell(v) s = '{}'; else s = sprintf('%s([])', class(v)); end elseif isa(v, 'double') s = sprintf('zeros(%s)', mat2str(sz, 17)); elseif iscell(v) s = sprintf('cell(%s)', mat2str(sz, 17)); else s = sprintf('%s(zeros(%s))', class(v), mat2str(sz, 17)); end elseif ~ismatrix(v) nd = ndims(v); s = sprintf('cat(%d, %s)', nd, strjoin(cellfun(@repr, ... squeeze(num2cell(v, 1:(nd - 1))).', ... 'UniformOutput', false), ', ')); elseif isnumeric(v) if ~isreal(v) s = sprintf('complex(%s, %s)', repr(real(v)), repr(imag(v))); elseif isa(v, 'double') s = strrep(repr_matrix(@arrayfun, ... @(x) regexprep(char(java.lang.Double.toString(x)), ... '\.0$', ''), v, '[%s]', '%s'), 'inity', ''); elseif isfloat(v) s = strrep(repr_matrix(@arrayfun, ... @(x) regexprep(char(java.lang.Float.toString(x)), ... '\.0$', ''), v, '[%s]', 'single(%s)'), 'inity', ''); elseif isa(v, 'uint64') || isa(v, 'int64') t = class(v); s = repr_matrix(@arrayfun, ... @(x) sprintf('%s(%s)', t, int2str(x)), v, '[%s]', '%s'); else s = mat2str(v, 'class'); end elseif islogical(v) || ischar(v) s = mat2str(v); elseif iscell(v) s = repr_matrix(@cellfun, @repr, v, '%s', '{%s}'); elseif isa(v, 'function_handle') s = sprintf('str2func(''%s'')', func2str(v)); else error('Unsupported type.'); end end function s = repr_matrix(map, repr_scalar, v, format_matrix, format_class) s = strjoin(cellfun(@(row) strjoin(row, ', '), ... num2cell(map(repr_scalar, v, 'UniformOutput', false), 2).', ... 'UniformOutput', false), '; '); if ~isscalar(v) s = sprintf(format_matrix, s); end s = sprintf(format_class, s); end

That felt like a lot of work… and that’s only supporting the “plain old data” types: struct and cell arrays, function handles, logical and character arrays, and the various floating-point and integer numeric types. As the help indicates, Java and `classdef`

instances are not supported. A couple of other cases are only imperfectly handled as well, as we’ll see shortly.

**Struct arrays**

The code starts with struct arrays. The tricky issue here is that struct arrays can not only be “empty” in the usual sense of having zero elements, but also– independently of whether they are empty– they can have *no fields*. It turns out that the `struct`

constructor, which would work fine for “normal” structures with one or more fields, has limited expressive power when it comes to field-less struct arrays: unless the size is 1×1 or 0x0, some additional concatenation or reshaping is required. Fortunately, `cell2struct`

handles all of these cases directly.

**Multi-dimensional arrays**

Next, after handling the tedious cases of *empty* arrays of various types, the `~ismatrix(v)`

test handles multi-dimensional arrays– that is, arrays with more than 2 dimensions. I could have handled this with `reshape`

instead, but I think this recursive concatenation approach does a better job of preserving the “visual shape” of the data.

In the process of testing this, I learned something interesting about multi-dimensional arrays: they can’t have trailing singleton dimensions! That is, there are 1×1 arrays, and 2×1 arrays, even 1x2x3 and 2x1x3 arrays… but no matter how hard I try, I cannot construct an *m*x*n*x1 array, or an *m*x*n*x*k*x1 array, etc. MATLAB seems to always “squeeze” trailing singleton dimensions automagically.

**Numbers**

The `isnumeric(v)`

section is what makes this problem almost comically complicated. There are 10 different numeric types in MATLAB: double and single precision floating point, and signed and unsigned 8-, 16-, 32-, and 64-bit integers. Serializing arrays of these types *should* be the job of the built-in function `mat2str`

, which we do lean on here, but only for the shorter integer types, since it fails in several ways for the other numeric types.

First, the nit-picky stuff: I should emphasize that my goal is “round-trip” reproducibility; that is, after converting to string and back, we want the underlying bytes representing the numeric values to be unchanged. Precision is one issue: for some reason, MATLAB’s default seems to be 15 decimal digits, which isn’t enough– by *two*— to accurately reproduce all double precision values. Granted, this *is* an optional argument to `mat2str`

, which effectively uses `sprintf('%.17g',x)`

under its hood, but Java’s algorithm does a better job of limiting the number of digits that are actually needed for any given value.

Other reasons to bypass `mat2str`

are that (1) for some reason it explicitly “erases” negative zero, and (2) it still doesn’t quite accurately handle complex numbers involving `NaN, `

although it has improved in recent releases. Witness `eval(mat2str(complex(0, nan)))`

, for example. (My implementation isn’t perfect here, either, though; there are multiple representations of `NaN`

, but this function strips any payload.)

But MATLAB’s behavior with 64-bit integer types is the most interesting of all, I think. Imagine things from the parser’s perspective: any numeric literal *defaults to double precision*, which, without a decimal point or fractional part, we can think of as “almost” an `int54`

. There is no separate syntax for integer literals; construction of “literal” values of the *shorter* (8-, 16-, and 32-bit) integer types effectively *casts* from that double-precision literal to the corresponding integer type.

But for `uint64`

and `int64`

, this doesn’t work… and for a while (until around R2010a), it *really* didn’t work– there was no way to directly construct a 64-bit integer larger than 2^53, if it wasn’t a power of two!

This behavior has been improved somewhat since then, but at the expense of added complexity in the parser: the expression `[u]int64(`

*expr*`)`

is now a special case, as long as ** expr **is an integer literal, with no arithmetic, imaginary part, etc. Even so much as a unary plus will cause a fall back to the usual cast-from-double. (It appears that Octave, at least as of version 4.0.3, has not yet worked this out.)

The effect on this serialization function is that we have to wrap that explicit `uint64`

or `int64`

construction around each individual integer scalar, instead of a single cast of the entire array expression as we can do with all of the other numeric types.

**Function handles**

Finally, function handles are also special. First, they *must* be scalar (i.e., 1×1), most likely due to the language syntax ambiguity between array indexing and function application. But function handles also can have workspace variables associated with them– usually when created anonymously– and although an existing function handle and its associated workspace can be *inspected*, there does not appear to be a way to *create* one from scratch in a single evaluatable expression.

Oh boy, this is an area I’ve also extensively researched myself!

> This behavior has been improved somewhat since then, but at the expense of added complexity in the parser: the expression [u]int64(expr) is now a special case, as long as expr is an integer literal, with no arithmetic, imaginary part, etc.

Would you believe I’m the one who, almost a year ago, started this conversation on the Octave bug tracker? 🙂 It’s still an active discussion, should you want to weigh in. It’s another great example of Matlab’s ad hoc language design.

http://savannah.gnu.org/bugs/?45945

The official documentation for the .mat format (level 5) is very revealing on the nature of Matlab’s types. I believe it’s essentially a (neat and tidy) memory dump, or at least it used to be at some point. For example, it wasn’t until I read this that I realized Matlab doesn’t actually have a scalar type. Everything, even what appears to be a scalar, is really a matrix/array of at least two dimensions, even if it’s 1×1, 0x0, or 0x100. This is related to the dimension squeezing you’re seeing.

As for closures, I honestly don’t think they’re worth trying to serialize. If they close over variables, you won’t be able to get to their values, especially because they’re not necessarily workspace variables. You also won’t know it actually closes over a variable unless you parse the function string and search for free variables in the AST, which is obviously very non-trivial.

Here are my thoughts on this matter in Emacs Lisp:

http://nullprogram.com/blog/2013/12/30/

And when I did it for JavaScript, I completely punted on serializing closures/functions. It just throws an error and quits:

http://nullprogram.com/blog/2013/03/28/

Fortunately for you, Matlab doesn’t support circular data structures. One less thing to worry about.

I forgot to add one more fun fact: Matlab doesn’t strictly enforce well-formedness when loading .mat files. You can (ab)use this by crafting your own .mat files that load odd-structures values that cannot otherwise be constructed from within the language itself. This *may* include matrices with trailing singleton dimensions, but I’d have to test it out.

Thanks for the link to the .mat documentation, I don’t think I had seen that before. Looking through it, I don’t see anything explicit (in section 1-17) about trailing singleton dimensions, so I’m not sure where the squeezing is happening. For example, I have tried doing this from a MEX function with

`mxCreateNumericArray(3, {2,2,1}, mxDOUBLE_CLASS, mxREAL)`

… but even before returning out of the function, mxGetNumberOfDimensions() already returns 2 instead of 3. Would be interesting to see what happens from an “abused” .mat file.> Fortunately for you, Matlab doesn’t support circular data structures. One less thing to worry about.

It does, actually– I just didn’t bother even trying to mess with it :). Handle class instance variables are effectively references/pointers, but I skipped user-defined classes altogether.

Here’s one of these crafted .mat files. It has a variable with an illegal name and a struct with illegal field names, but Matlab loads it just fine. You’ll be able to read from but not write to these fields, and accessing the variable will require some cleverness.

http://skeeto.s3.amazonaws.com/share/weirdo.mat

Prior to 2014 or so, Matlab used signed integers internally when operating on metadata but would validate incoming .mat metadata values as unsigned. It was pretty easy to create .mat files that would immediately crash Matlab on load. I did it a lot by accident, so I was happy to have it fixed.

The variable “strange_dims” is defined as 5 dimensions (2x1x1x1x1) but Matlab squeezes it, either as part of loading or maybe on first use. I can’t trick it into leaving trailing singletons.

As for squeezing, consider the way a matrix is structured in a .mat file. The matrix is specified as the number of dimensions N, an array of N integers, and data itself stored in a big flat buffer. Squeezing tailing singletons is a simple matter of decrementing N until they go away. It’s so trivial that I think Matlab does it basically every access. Perhaps its a historical artifact.

Very interesting! I can’t seem to get quite as far (just tested on Windows R2016a); the return-value form of load(), i.e. “s = load(‘weirdo.mat’)”, fails with an “Error using load, Invalid field name: ‘*-trickster-*’.” The nargout=0 form *does* execute without any errors, but only strange_dims is visible in the workspace.

(I thought I might have better luck with h=matfile(‘weirdo.mat’); whos(h), to try to “inspect” the file without actually trying to load it. No luck, got a more verbose error “‘*-trickster-*’ is not a valid dynamic property name blahblahblah.”)

Interesting. I only tested it on R2015. They must have added more validation recently.

Pingback: Floating-point agreement between MATLAB and C++ | Possibly Wrong