To conclude the ongoing post series on data serialization and organization, I would like to introduce a proposal for a universal data serialization format. If you have studied the subject a bit, a knee-jerk reaction may be “Why? Don’t we already have JSON, CSV, XML YAML HDF5, and countless programming language-specific options?”. However, the huge amount of formats we have suggests to me that the data serialization format problem has not be fully solved. And I hope I can convince you that I have a sound proposal to solve it in a very general way, without going for extreme complexity at the same time.
Why existing solutions are not enough
Plain text is not a file format in and of itself, as it is insufficiently well-specified for practical purposes. But many existing data serialization formats, including most ad-hoc application file formats, base themselves on plain text. The rationale for it is typically that plain text is human-readable and may be parsed by most programming languages with reasonable ease.
But when it comes to it, plain text files are not intrinsically human readable. The only reason us humans may read them is that text editors who may parse their bytes and display them on a computer screen are widespread. So saying that plain text files are as readable because everyone has access to the tools needed to read their content, is exactly as logical as saying that Windows is usable because everyone has learned how to use it.
The intrinsic readability advantages of plain text also tend to break down as soon as one tries to express complex structured data using it. The most egregious example of this are of course XML and LaTeX, but even less wasteful formats like JSON and YAML turn out pretty good at losing their reader in a sea of indentation. In practice, these require the use of specialized editors to fold unused indented blocks and thusly retain reader sanity.
Since plain text formats attempt to express the natural patterns of humans storing data in a text editor, they also tend to be as ill-specified as human behavior itself. CSV and Markdown have a million variants for exactly this reason.
And then there are encoding problems, escaping problems caused by the simultaneous of text as a container and as a content, and perhaps most important, performance problems.
Text is an intrinsically inefficient way to store numerical data, which proves to be a problem as soon as one attempts to store very large amounts of data using it:
- By using bytes which can store values from 0 to 255 in order to store single digits from 0 to 9, one effectively stores data 2.5 times less efficiently than what is theoretically possible.
- By storing floating-point data in base 10 instead of base 2, one ends up with long trails of aproximate decimals, typically requiring as much as 24 characters (plus a separator) to accurately store an IEEE double-precision number that fits in 8 bytes of computer memory.
- To parse textual numerical data during file opening and saving is also dramatically slower than to load an unsigned integer or floating-point number from RAM to a CPU register.
- Binary data lends itself well to regular packing, allowing the use of fast bulk data I/O primitives like the C fwrite() and fread() function. OTOH, textual data’s freeform nature usually requires much slower item-wise sequential parsing.
To get you an idea of how relevant this is in practice, I wrote trivial C programs to store and retrieve 10 million doubles in binary and textual form. The binary form takes 80 MB, is written in 0.175s and read in 0.110s. The textual form takes 250 MB (3.1x more), and is written in 4.870s and read in 3.040s (28x slower).
All of these programs were run without system tweaks, thus benefitting from a hot disk cache, which essentially eliminates file I/O bottlenecks in this simple test. In a cold cache scenario, the processing time difference becomes less relevant in the face of load time differences, to the benefit of text file formats. Assuming writes to an SSD running at 100 MB/s, I can estimate that binary writing would jump to 0.975s, binary reading to 0.910s, text writing to 7.37s and text reading to 5.54s, bringing the text slowdown closer to the 6-8x region (which would still be too much for performance-sensitive applications).
This is the core reason why text file formats are not, and will never be, enough for large datasets like pictures and videos, instead being replaced by domain-specific containers that constantly reinvent each other like MPEG containers and the HDF family. I believe that this is a sad state of affair, and a convincing rationale for the design of a true general-purpose binary file format, giving the binary word access to the level of generic greatness that JSON has enabled in the text world.
Proposed design goals
One lesson learned from JSON and YAML is that the combination of nested structured data and lists is a very powerful paradigm to express arbitrarily complex data. This is not so surprising, considering that this is what most programming languages base their data model on, but it should serve as input to any attempt at designing a universal file format, textual or binary.
Another philosophical position, which is widespread in the open-source community and which I find myself agreeing with, is that where space and loading speed is not a critical issue, file formats should strive to be self-documenting. Data identifiers should be Unicode strings, and it should be possible to comment tricky parts of data files. This makes it envisionable to recover content from data files whose source program has disappeared and whose documentation has been lost, avoiding a “VHS effect” where old data becomes totally unreadable after a certain amount of time has elapsed.
I believe that as an important part of self-documentation, and a powerful debugging tools, data files should be type-safe. It should be possible to identify data of primitive types easily, and to discriminate between user-defined data types. And a compliant parser should also be able to throw exceptions when software attempts to interpret data in an incorrect way.
Supported data types
To make high-performance data manipulation accessible, the format should have support for the de facto standards for binary numerical data, that have widespread native hardware support. At this point in time, this would mean unsigned integer words from 8 to 256 bits, signed integer in two’s complement representation, and binary floating point numbers in IEEE 754 representation. This list should not be set in stone, but rather open to future hardware evolution.
Beyond standard numerical data, common programmatic data structures should also be supported. I propose that the format offer native support for UTF-8 strings, lists, and structured data. 1- and N-dimensional arrays of identically typed data should also be supported, as they cannot be efficiently expressed in terms of inhomogeneous lists (due to the need to store redundant type and length information).
To allow for the expression of all programmatic data structures, there also needs to be a reference type, allowing a file record to refer to data stored in another file record. There should be an invalid “null” reference, preferably represented by the numerical value 0. Parsers should obviously mind basic pointer gotchas when manipulating references, such as not attempting to follow null references, references pointing outside of the file boundaries, or infinite reference cycles.
To compete with special purpose binary formats, one should strive to eliminate any kind of unnecessary redundancy from data files. Any format-bound information should ideally only be specified once within the file, or, in algorithmic terms, file format overhead should scale linearly in the amount of records. This entails that to address the common case of storing arrays of structured data, one should be able to define custom structured data types and only specify, upon instantiation, the information about them that changes from one file record to another.
But although a type system is necessary in order to support type checking and to eliminate unnecessary data redundancy, a full type system comparable to that of strongly typed programming languages is too complex to be expressed in a generic spec. For example, supporting Ada-like restricted integer ranges falls outside of the scope of this proposal.
Similarly, any kind of data structure that does not have native hardware support and is not in common use, and that can be stored using the framework of the previously defined data types (such as trees, program bytecode, or packed data structures) in a manner that is straightforward, efficient, and type-safe, needs no native support from this file format.
As with any binary format, endianness matters must be handled appropriately. The format should just feature a non-palindrome header allowing a parser to detect when an endianness conversion is necessary upon file load. Files should have no intrinsic endianness, so that endianness conversions be only necessary when data is transferred between two hardware architectures of incompatible endianness.
When used in native endianness conditions, the file format should be very efficient to parse, both in sequential and random ways. The later implies support for fast traversal and data indexing. Creating a file and appending data to an existing file should also be very efficient. However, data modification needs not to be guaranteed fast.
Designing a universal file format is hard, so mistakes will be made, and changes will be needed. Thus, the file format needs to be versioned. However, in the face of change, one should still strive to keep this format as backwards and forward compatible as possible.
Finally, although sequential file access with optional seeking has been the dominant paradigm for a while, memory-mapped files are a very interesting alternative approach, that has only received little attention because of the difficulty of mapping in-memory data structure onto serialized files. It think it could thus be potentially interesting to allow optional support for file record alignement on a word or cache line boundary, so as to make it possible to directly manipulate memory-mapped data in a program without going through copying and translation phases, in selected scenarios.
Preliminary specification work
First, please not that this is not a complete file format specification. For example, integer sizes will not be set. This is just an attempt to propose a tentative grammar for the universal file format outlined above, and see if it would make any sense.
The proposed structure is very classic: a file header, specifying some global file properties, followed by an arbitrary amount of file records, containing data, comments, and data type definitions.
With the exception of primitive data types that are part of the format specification, data types must be declared before they are used. However, they may be declared anywhere within the file, including after a number of data fields. This property allows for fast file appends, even when new data types are being defined, but should not be a recommended practice as it hampers file readability.
Data types are identified by a numerical identifier, which should be at least 16-bit and preferably 32-bit if the incurred space overhead proves negligible. A range of identifier values, for example 0-255, is reserved for primitive data types, while the remaining identifiers are free for custom use. Identifier 0 is reserved as an invalid type identifier.
File header : formatID version header_size alignment
The file header should contain the following information, in this order :
- A non-palindrome integer word, identifying the file format and specifying its endianness.
- A version number, specifying the version of the file format that is being used.
- An integer indicating how many bytes remain within the header, allowing for legacy parsers to skip the parts of newer headers that they do not understand.
- Information on the boundaries according to which file records and their contents will be aligned.
The last part is not specified at this point, because it requires some research which I have not done yet on how programming language implementation lay out their data in memory. This is necessary in order to determine the precise amount of alignment constraints that need to be provided, for the file format to be able to match the in-memory layout of typical program data structures.
General file record structure : string [extra_information]
In the current version of the specification, file records may come in three kinds : data fields, which contain named data, comments, which feature self-documentation not handled by the parser, and data type definition.
All kinds of file records start with an UTF-8 string, in NFD normalization, stored as an integer specifying string size in bytes, followed by the actual string data. The contents of that string vary from one record type to another, and will be defined in the individual record type definitions below.
Some record types contain nothing besides that string, whereas others need extra information that is simply concatenated after the identifier, to the extent allowed by the applicable alignment constraints.
Data fields : identifier typeID [parameter […]] data
Data fields start with an identifier string, which may include any Unicode character except for the reserved “.” character. Identifier naming is left up to software developers, but identifiers must be kept unique to a specific data member. Parsers do not need to check for identifier unicity when loading a named data member, and are allowed to pick the first data element in the file that matches a requested identifier instead. Parsers may set further restrictions on the identifier character set that they can parse, but are advised not to do so.
After the identifer comes the type identifier for the incoming data, along with a number of optional type parameters if a specific type requires additional information. Parsers are expected to check this type identifier against any program expectation, and report an mismatches as error by the standard means of the host programming language. Alternatively, in dynamic programming languages, they may use this identifier to set the type of data in memory when dynamically building data structures from a file.
Proposed primitive data types are defined as follow:
- Unsigned integer words from 8 to 256 bits, associated two’s complement signed integers, and IEEE 754 binary floating point numbers (single, double, and quad). These types require no parameters, their identifier is followed by the corresponding scalar data.
- Example : “<0x08>myNumber<[byte typeID]><0xFF>”
- References require no additional parameter. Their identifier is followed by a 64-bit offset, within the file, from the reference’s location to the data being referred to. The reference points to the typeID of the target data field, allowing for reference chaining, and removing the need for reference typing. Reference 0 is invalid, and may be used to represent null pointers.
- Example : “<0x0B>myReference<[ref typeID]><0x00 00 00 00 00 00 01 52>”
- Strings are defined in exactly the same way as file record identifiers : size in bytes, followed by the string data in NFD-normalized UTF-8 encoding. Strings do not need to be zero terminated, parsers who need to output zero-terminated strings must save up one extra character and manually add zeros as needed.
- Example : “<0x08>myString<[string typeID]><0x0D>Hello world !”
- Lists take two parameters : the size in bytes of the upcoming data, followed by the length of the list in items. The former is necessary in order to enable fast file traversal, the later for proper parser operation. Data is specified as a series of (typeID [parameter […]] data) structures, identical to that of a data field declaration but without the identifier part.
- Example : “<0x06>myList<[list typeID]><[remaining record size in byte]><0x02><[byte typeID]><0xFF><[string typeID]><0x0D>Hello world !”
- 1-dimensional array declarations also feature record size, followed by a type instantiation, that is, a typeID and the parameters that apply to that specific type, as in data fields. The type instantiation is followed by the length of the array. Finally, the array data is provided, without redundant type information.
- Example : “<0x07>myArray<[array typeID]><[remaining record size in byte]><[byte typeID]><0x04><0xDE><0xAD><0xBE><0xEF>”
- N-dimensional array declarations begin with record size and a type instantiation, as in the case of 1-dimensional arrays. The length part of 1D arrays, however, is replaced by the dimension of the array, followed by its length across each dimension, starting from rows. Finally, array data is provided, in row-major order.
- “<0x09>my2DArray<[ND array typeID]><[remaining record size in byte]><[byte typeID]><0x02><0x04><0x03><0xDE><0xAD><0xBE><0xEF><0xBA><0xDB><0x00><0x5E><0xDE><0xAD><0xC0><0xDE>”
- Structures are defined similarly to lists, with both size and length information. Their defining characteristic is that unlike with lists, their data is provided as full file records, and may include comments. This allows for the definition of hierarchical data structures.
- Example : “<0x08>myStruct<[struct typeID]><[remaining record size in byte]><0x02><0x09>innerByte<[byte typeID]><0xFF><0x08>innerStr<[string typeID]><0x0D>Hello world !”
Comments : comment_string
The string part of the comment is defined to begin by the dot (“.”) character, followed by a space. Upon encountering these two characters, a compliant parser should ignore anything that follows. Comments may thus contain any Unicode character, including stars, dots, Chinese ideograms, tabulations, and line feeds, without any clumsy escaping necessary.
Example : “<0x3E>. This is a comment and will not be interpreted by the parser”
Comments have no content behind their comment string.
Type definitions : typedef_string newID oldID [(parameter | invalid) […]]
Type definitions begin with a special string beginning with a dot, identifying the following content as a type definition. For example, that identifier could be “.typedef”, “.def” or “.type”.
Type definitions serve two roles:
- To allow for type safety, by letting a file format define aliases to the primitive data types (or to custom data types) that are incompatible with their ancestor and with one another.
- To remove unnecessary redundancy in stored data, for example allowing for a concise representation of arrays of structures.
The general scheme proposed to address the later scenario, is to provide a partial instantiation of the former type. Basically, one proceeds to define the ancestor type, as in a data field, by specifying its ID followed by any type parameter that apply. However, unlike in data field, it is permitted to use invalid parameter values in type parameters. The parser interprets such a practice as not setting the value of that type parameter, and will thus expect it to be set upon type use in data fields.
Effectively, invalid parameter values end up being used like %-expressions in printf() format strings, warning the parser that this data type requires additional information that will be specified upon type instantiation.
Let’s clarify invalid values for all primitive data type parameters :
- Strings, list, array and structure of length 0
- Sizes in bytes of 0 (actually, it should be made illegal to specify a size as part of a type definition, since this is more likely to be a mistake than an actual operation)
- Array dimension of 0
- Invalid typeID (typeID 0)
To illustrate this scheme, let’s assume we would want to define a custom string type, with type identifier 32768, that will always have a length of 13 characters. Its type definition would go as follow :
“<0x08>.typedef<0x80 00><[string typeID]><0x0D>”
One could then define a string of that type without specifying its length :
“<0x0F>myBoundedString<0x80 00>Hello world !”
Obviously, in this case, there are no real size benefits, because the newly created type is used only once. Defining types is only useful in order to enforce type safety, or to remove significant amounts of redundancy, as in the case where large amounts of identical structures are stored within a file.
Conversely, if one wanted to define a custom string of unspecified length, one would do it like this :
“<0x08>.typedef<0x80 01><[string typeID]><0x00>”
Instanciating this string would require specifying the missing length parameter, as follows :
“<0x0E>myCustomString<0x80 01><0x0D>Hello world !”
A very important special case is that of structures, for which complete field definition is to be provided as part of the type definition, as if one were declaring a structure data field, but without the actual data :
“<0x08>.typedef<0x80 02><[struct typeID]><0x00><0x02><0x09>innerByte<[byte typeID]><0x08>innerStr<[string typeID]><0x00>”
Instantiation of the custom structure would then only need to specify the parameters not set as part of the type definition, that is, size in bytes, container/string length… and the data itself.
“<0x0E>myCustomStruct<0x80 02><[remaining record size in byte]><0xFF><0x0D>Hello world !”
I believe that there is merit in this proposal. It paves the road for self-documenting file formats for a wide range of applications, all following the same underlying specification, and thus using the same parsing code.
- Its grammar is reasonably simple, yet powerful enough to address a wide range of needs.
- It is very compact for all commonly used data structures, and natively supports machine scalar types and Unicode strings.
- It handles endianness and alignement concerns, potentially allowing for direct use of memory-mapped files as program data structures for high-integrity applications.
- It supports all typical data containers, including array for compact data storage.
- Redundancy is avoided, making the format suitable for high-performance application like media content storage or massive data processing
- Files are fast to parse and traverse, appending is supported, only in-situ modification can be slow (though it isn’t for e.g. modifying array contents without adjusting their length)
Am I missing something important that a universal serialized format standard should provide ?