Treating everything as plain text files

An early design decision of the UNIX operating system was that system resources were to be abstracted as files, plain text files whenever possible, organized in a hierarchical filesystem. Such resources would include pretty much everything that was shared between programs and other programs or users, from user data and software configuration to hardware interfaces.

At the time, this common resource access metaphor was a groundbreaking innovation, when competing operating systems would usually feature one unique interface per kind of system resource. The latter approach was obviously inferior, because its usability was much weaker (as one couldn’t reuse existing knowledge to face new problems), and it implied a much higher implementation complexity.

However, UNIX was designed in the 70s, and the computing landscape has changed dramatically since then. How did this design decision stand the test of time? This post will attempt to discuss its pros and cons in modern operating systems, and ways it could be improved upon.

Why plain text files?

First, let us wonder why plain text files were chosen as a universal OS interface in UNIX. After all, binary numerical data can efficiently express any other kind of data, including plain text, while plain text can only efficiently express human-readable text, and is a memory- and processing-inefficient medium for any other kind of data. Including numbers, which is the single thing that computers manipulate most all day.

To answer this question, one has to understand that in the days where UNIX was design, it was common for users to manually parse and edit computer data, without the assistance of any kind of software but very primitive plain text and hexadecimal binary editors. Thus, at this point of computer history, having software operate on human-readable data was a strong selling point, as it saved people from the trouble of going through a hexadecimal editor, and following machine-friendly data encoding in order to do the manual editing.

Moreover, processing plain text also was (and still is) a very important task for computers. After all, since everyone has agreed that punch cards and hand-written processor mnemonics were a bad idea, computer programs have always been written in plain text. And since folder hierarchies were (and still are) the universally agreed upon standard for data organization, UNIX also had to support them. So the functionality had to be there anyway, and it made sense for UNIX to reuse the system abstractions used to implement that for other purposes, instead of creating new metaphors to access other kinds of shared system resources.

Now, let’s go fast forward to these days, and see how lessons learned and modern constraints match this design philosophy.

Plain text as a data storage format

Since the days of Research UNIX, computing has grown to a scale that was previously undreamed of. Computers are in every home and every pocket, and connected to each other via ridiculously fast interconnects. The world wide web is the largest repository of human knowledge ever built by several others of magnitude, and large-scale computing facilities have an amount of number-crunching power that would have made old Charles Babbage have a heart attack.

To build all this computing infrastructure and keep it working, programmers had to learn a huge lot about data serialization.

  • They learned that plain text handling is only simple if it solely involves (a subset of) English text. The need for multiple language support and international data interchange ultimately led to the emergence of a huge beast called Unicode, which is a truly universal medium for plain text, but puts the “simple” out of text processing.
  • As even text had to go multi-byte (since the predecessors of Unicode did not feature a byte-based encoding like UTF-8), people realized that endianness was not as big of a deal as it was made to be. Just put an unsigned integer number that isn’t a binary palindrome at the beginning of your document and be done with it.
  • The emergence of multimedia content meant that plain text stopped being a reasonable universal medium for data storage. Expressing one number, or a few thousands, in a way that’s both inefficient to store and process may be good enough, but expressing 3*8*1920*1080*24*3600 numbers (aka one hour of uncompressed HD video) certainly isn’t.
  • Since humans couldn’t realistically manually parse datasets this large, as computing power grew, data manipulation tools gradually became a lot more sophisticated and domain-specific, relegating hexadecimal editors and text editors to very specific areas of applications.
  • In spite of ever-growing processing power, the processing inefficiency and scalability issues of simple plain text formats like CSV, XML and JSON also became a problem in several of their traditional areas of applications, including web development and scripting (where JIT and AOT compilation to binary data was introduced), database storage (where binary equivalents to text formats like BSON were designed), and scientific data analysis (where FITS and HDF5 started taking over the world in large facilities).
  • In the process of introducing binary data in some of these solutions, previous concerns that binary data containers couldn’t be both self-documenting and universal were shown to be untrue, through the continuing development of general-purpose binary containers like HDF5 and UBJSON.
  • As software grew in complexity, simple 2D arrays and lists of [key, value] pairs stopped being enough for our data storage needs, and the development of a myriad of incompatible hierarchical data storage formats made people realize that “plain text” isn’t a complete file format specification.
  • As computing scaled to a larger user base, we computer users realized that user interface designers could and should design more usable interface models than “go to this configuration file and change the value of this flag”. This brought the inefficiency of parsing configuration files saturated with long comments for human readability into question.
  • Object-oriented programming came around and imposed itself as the dominant programming paradigm for some applications, such as GUI development. Its principle of information hiding brought the problem of storing perpetually evolving data structures into forward- and backward-compatible containers in the spotlight. In some respects, OOP worsened the data compatibility situation by making programmers think that the inner structure of the data they manipulate is irrelevant and may be freely changed.
  • At this point, this last data serialization problem has not been fully solved, and possibly will never be, though dynamic programming languages like Python and IDL-based object storage formats like Protocol Buffers certainly are trying their best.

What should we remember from all these lessons of data serialization history? That data serialization is a very complex and fascinating computing problem, that may have no general answer. And that if it has one, “just express everything as plain text” most certainly isn’t it.

Files as a universal OS interface

Beyond the relevance of plain text itself as a layer for data serialization, one has to wonder if files themselves are a reasonable universal abstraction for shared OS resources.

As mentioned above, one of the top UNIX design innovation in the realm of resource management was to use the file metaphor not only for data in itself, but also other kind of system resources, like random number generators ( /dev/random ), minimal data sinks for the pipe interface ( /dev/null ), and network sockets. At the time, it seemed like a good idea, as it simplified the OS design by reducing the amount of abstractions that the OS had to support, and that users had to learn.

However, I would argue that this unification was only achieved by dramatically complexifying the file abstraction itself, compared to its “dumb piece of data” counterpart in other operating systems.

In UNIX, a file is not a piece of data. It is a virtual input port, or output port, or both. That may be seek()-able, or not. That may expose bounded contents (through EOF-semantics), or unbounded one (and there’s no way to know in advance). That has complex buffering semantics which are a frequent source of errors for programming newbies (forgotten fflush(), anyone?), as a performance optimization from the dark ages of computing before asynchronous I/O.

The UNIX file metaphor is an extremely complex abstraction. Which makes it, by my standards, a poor abstraction. Good software abstractions solve a clear problem, are easy to explain, and comform to user expectations, while UNIX files do neither. They totally break end-users’ and programmers’ mental model of files as pieces of data, and then run over it with a steamroller. You may argue that this is for good reasons, and I beg to respectfully disagree. Even the “file” name, in itself, has root in pieces of data, not input and output ports.

There is data, and then there are streams of data. UNIX is designed around the philosophical position that the difference between these is irrelevant, whereas I will argue that in the face of computing evolutions, it makes a world of difference. Sequential streams are just one of many programming models for data access that programmers can use. Others include, for example, memory-mapped files and databases, which abstract the whole data loading and storing process away. Such abstractions allow for data loading to be parallel rather than sequential, in the background rather than in the foreground, or on-demand rather than all-at-once.

Data streams make sense for computing problems which are actually well described in terms of them. Such as network communications, or port-based processor IO. But other system resources are better expressed by other simple computing metaphors, and in this case, shoving them into a stream box for the sake of design purity feels plain wrong to me. There is data, and then there is the way we access it, and if both are separate in the real world, they should be kept separate in the virtual role that operating systems expose to us. Otherwise, the OS design fails at future-proofing, as one will soon realize when trying to apply traditional UNIX metaphors to modern parallel and distributed computing scenarios.

Finally, treating everything as a file leads to a proliferation of domain-specific filesystems, the wisdom of which can be discussed. Just count the amount of filesystems that a modern Linux system requires in order to run, from tmpdevfs, to sysfs, to socketfs, to pipefs…

Byte streams as a sequential I/O interface

Still, data streams make sense for some intrinsically sequential kinds of communication, such as audio I/O or TCP/IP packet exchanges. In these areas, they turn out to be the right software metaphor for the real-world problem they solve. We can discuss the way we handle such scenarios as multiple concurrent data streams, and end up on a wealth of unsolved research problems but at a fundamental level, the stream metaphor of “pushing data into a pipe and pulling it from the other end” is right.

The next question is, why did UNIX streams have to be composed of bytes ?

In modern computing scenarii, single bytes are usually too short to represent any kind of meaningful data, with the possible exception of boolean truth (for which they are too wide instead). Even text, one long-standing stronghold of single-byte information, had to give up and go multi-bytes too in Unicode, adding insult to injury by using a variable-length form of multi-byte encoding in its UTF-8 variant.

Hardware, too, has largely gave up on handling single bytes. Processing these, as it turns out, is a lot less efficient than processing groups of bytes such as 8-byte integers, 32-byte vectors, 1920*1080*3-byte video frames, or other kinds of multi-kilobyte buffers. These larger chunks of data are much more efficiently handled by today’s massively parallel and superscalar hardware, as can be realized by pondering that a typical processor performs binary operations on bytes as quickly as binary operations on 32- or 64-bit words, or in other words, 4 to 8 times slower per bit of processed data.

An argument in favor of byte streams is that any kind of binary data can be decomposed into a stream of bytes. However, an important semantic difference remains between a stream of bytes and a stream of objects, in that pushing a serialized object through a byte stream isn’t an atomic operation. Inbetween the serialized bytes of a given object, software may block, freeze, crash, or do all other kinds of nasty things, and the receiver at the other end of the byte stream will have to neatly handle all of these nasty things, which adds a great deal of complexity to the development of communicating software in UNIX.

Of course, you may try to ignore the complexity by having a software library take care of it for you, and offloading the pain to the designers of that library. But I will argue that when thousands of developers end up writing or using libraries to handle a specific functionality, that functionality should be standardized into an operating system primitive. That operating systems should have standard support for the modern use case of atomically sending type-safe messages through a communication channel, in a fashion that guarantees that either the full message will make it through the channel, or the software on the other end won’t ever hear of what happened to the unspeakable horrors that attempted to get through.

To me, the only real design question here would be: do we provide native support for variable-length and variable-type messages or not? Variable message length and communication channel multiplexing through typing intrinsically adds an amount of overhead to said communication channel, so software should have an option to disable these if it doesn’t need them. If all you will ever send through a message-passing channel are 32-bit integers, then messages going through that channel, after the initial handshake period, should solely be 32-bit integers. Not type codes of 32-bit integers, followed by length of 32-bit integer, followed by 32-bit integers.

Why? Because the receiving end of the communication already knows this information, so it is unnecessary and redundant, and perpetually sending it pollutes the communication channel for nothing. Low-level communication primitives should aim for zero messaging overhead wherever they can afford it, if they don’t want to become a processing and transmission efficiency bottleneck in performance-sensitive applications.

Similarly, any serious scheme for message passing these days should offer some sort of signalling mechanism (for communication error handling and flow control, preferably in a separate communication channel than the message one as multiplexing has a cost), asynchronous interfaces, and avenues for bulk transfer of messages. Lack of support for these simple performance and usability optimizations is a design oversight that every message passing scheme will end up regretting sooner or later.

Message-passing channels are not new. Programming languages, shared libraries, and even a number of operating systems have had support for them for a while. I’d argue that it would be about time for them to become the de facto standard for operating system data streams, and for the notion of bytestreams to die already. It matches neither the reality of modern hardware, nor the reality of modern data, and is really just a shadow of the C/UNIX past that should vanish into computer antique oblivion.

Conclusions

While I admire the purity of the UNIX way, when it comes to treating almost every OS-managed system resource as a plain text file, I think that it has outlived its usefulness.

Plain text has proven itself inadequate as a One True standard for data serialization, both due to its incompleteness (you need additional structuring to make it useful), and due to its parsing and storage inefficiency. Especially so since Unicode has imposed itself as the only reasonable way to encode it.

Human readability of data has also stopped being as much of a major concern as it was in the 70s, and relying on it has become a software usability anti-pattern in most cases. Moreover, where self-documenting data is desired, binary formats have proven themselves suitable for the job as well as plain-text ones. Meanwhile, the evolution of software itself has also raised issues of backward and forward data compatibility that cannot be simply addressed by the simplest data serialization schemes, including your software-specific flavor of “plain text data”.

Simply put, “plain text”, both in itself and as a very loosely defined data serialization scheme, has stopped cutting it. If such a thing as a universal data serialization scheme can be built, it won’t be out of plain text.

Worse yet, in my view, is the UNIX notion of extending the file concept as a more general (and very convoluted) kind of data stream, and thusly implicitly equating both. Files are pieces of data, while streams are pipes that pieces of data can be pushed through. Streams are only one way to access data, and not always the most efficient or relevant one.

Finally, OS resources which ARE best expressed as data streams, should probably not be limited to byte-granular data exchange only. Single-byte manipulation and transmission is an artifact from the past, and OSs should strive to impose atomic message passing channels as their sole universal data stream primitive, instead of exposing single-byte exchange to users and requiring them to handle it. Message-passing channels can be made very flexible, but should still give users access to their most efficient mode of operation, that of transmitting arrays of data of known type and length.

And that’s all for my conclusions on the UNIX model of files. Stay tuned for next week, where I’ll further expand on organization of system resources through the hierarchical filesystem, and ways I think it could be improved upon so as to better fit modern use cases.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s