As discussed last week, a major role of an operating system is to manage shared computing resources. These resources can be pieces of data, or I/O streams, or procedural interfaces, or objects, in fact they can be any other kind of abstract computing resource.
I argued that the UNIX way of abstracting all of these as “files”, in the UNIX sense of a byte stream (preferably an ASCII text stream) with a variable amount of strange properties, was a questionable design choice, and discussed other ways to handle both pieces of data and I/O streams. But I agreed that they are both system resources, and that exposing them through a common interface is thus a good idea.
This post series will now focus on the role that the filesystem plays in such a unified exposure of system resources on UNIX systems, its limitations, and ways to improve upon it. Due to the complexity of the subject, it will take me a couple of blog posts to reach a satisfactory conclusion on this matter, so please bear with me if the discussion turns out a bit lengthy.
Unified file system magic
One of the first lesson of any UNIX course is that whatever you are looking for, it is likely lying somewhere under /, much like web resources are extremely likely to lie somewhere behind an http:// or https://. The unified UNIX filesystem exposes a huge amount of system resources through a hierarchical filesystem, which can be dynamically extended through the mechanism of “mounting” that plugs something that can be accessed like a folder hierarchy into another folder hierarchy.
Filesystem mounting tricks make it possible, on UNIX-inspired operating systems, to query the filesystem for anything ranging between local files, external media, network resources, keyboard input, audio output, temporary data that will be automatically deleted on system shutdown, or CPU capabilities and ACPI information. It’s all there. Though more recent UNIX-based operating systems have often sacrificed the purity of filesystem-based resource access for some resources, in the name of practicality, it remains one of the most extensive ways ever devised to discover and locate system resources without caring about which kind of resource exactly one is looking for.
And so, UNIX fans will be quick to brag about the awesomeness of having a unified filesystems, and they sure can, because from a design point of view, the idea was great. Simultaneously, though, one cannot ignore how newcomers to UNIX operating systems, and even UNIX gurus at times, are often totally lost when they try to FIND something in the middle of that huge hierarchy.
From a usability point of view, one can conclude that although the unified UNIX filesystem is great for locating known resources in a uniform fashion, it comes with its faire share of discoverability issues when people try to find unknown or partially known resources in it. But why is this? Can we analyze this failure further, so as not to replicate it?
Issues with the UNIX filesystem hierarchy design
One of the most infamous usability problems with the UNIX filesystem hierarchy, on modern UNIX cousins, is that when you look for something, even if you’ve learned the hefty FHS documents by heart, there are still many places to look for it. For example, a program can be located…
- In /bin
- In /sbin
- In /usr/bin and /usr/sbin
- In /usr/local/bin and /usr/local/sbin
- In /usr/share/bin and /usr/share/sbin
- In a nonstandard subdirectory of /usr
- In a subdirectory of /opt, someplace unknown
- In a user’s home directory, someplace unknown
- In /etc, if it’s a script that is run on specific occasions
- Somewhere else entirely
Accounting for this wide array of possible program locations without breaking the UNIX philosophy of making installed programs accessible everywhere requires setting a gigantic amount of possibilities in the PATH environement variable. This is typically done in a partially automated fashion that will break from time to time in circumstances unforeseen by the simple shell script performing the trick.
Similar conundrums exist for libraries, configuration files, desktop icons, and in fact pretty much any other kind of system resources if you compare multiple flavors of UNIX. There are schisms between /mnt, /media, and /media/<username>, as there are between /proc, /sys, and /dev. The cryptic usernames used by UNIX folders, so as to save characters in dark ages of computing where shell autocompletion didn’t exist, are already much of a liability from a usability point of view, but the lack of clear agreement on which thing goes where significantly aggravates this problem.
Yet if you do spend a sufficiently large amount of time exploring UNIX literature, you will find many competing standards defining precisely that. What will be surprising about these, though, is that they will often be based on matters so inconsequential these days as to feel wholly arbitrary. For example, the split between bin and sbin makes little sense when super-users will need access to tools in bin to do their work and less privileged users will sometimes be privileged enough to run sbin stuff too. And the split between root-level bin directories and usr-level ones is based on ancient storage space considerations from the PDP-11 era that fortunately do not apply to 21st century computers anymore.
Filesystem hierarchy aging problems
From these examples, we start to realize that the core design of the UNIX hierarchy standard did not age well, much like any information hierarchy ever devised (ever heard of the split between “science” and “technology” in the Dewey classification scheme, that is used by most libraries in the Western world ?). But since filesystem queries are also the way system resources are located on UNIX operating systems, fundamentally changing the UNIX filesystem hierarchy standard cannot be done without breaking a significant amount of existing software. This is why no one dares to do it. Even Fedora’s relatively minor change of moving root-level binaries to the /usr subdirectory generated a lot of heat in the Linux community.
Another way in which the UNIX filesystem hierarchy standard didn’t scale well, is that although in some areas it specified too much, in others it did not specify enough. It is common for directories in UNIX systems to feature thousands of files, because no further hierarchization scheme exists to organize this content. At this level of crowding, one simply cannot pretend that data is organized anymore : manually discovering the contents of the biggest UNIX directories is an exercise in masochism, and even looking for something you know is there, without remembering the full path to it, proves extremely difficult. Again, the UNIX unified filesystem is an undisputable success at providing unified means to locate system resources of many kinds, but a failure at making said resources discoverable.
The UNIX filesystem hierarchy, in a sense, is kind of like a black hole : once something gets lost in there, it’s pretty much lost forever. But how could one find a way around this unfortunate fate?
Adding more hierarchy to fix the issue?
A naive solution to this problem would be to redesign the UNIX filesystem hierarchy in a way that is more relevant to our century, including a deeper level of hierarchization to account for the exponential growth of software and library use, and a fundamental rethinking of the way some things are sorted, so as to reduce redundancy and ambiguity.
However, this fails to take into account that no human brain is really capable to think about a way all information in the world could be classified in an unambiguous way, which is in essence what such an endeavor would amount to. When we humans struggle to make consistent directory hierarchies for a couple hundreds of personal files, who would be foolish enough to extend such a clunky approach to millions of system resources?
Actually, I used to be a proponent of the universal filesystem hierarchy in earlier days of this blog, but at some point I just gave up. There’s no way it can possibly work. If Dewey and the brightest minds of his century couldn’t do it, neither can you and I. It’s just hopeless.
A fundamental issue of hierarchies is also that they are not future proof. Besides the technical considerations that stop making sense after a change of centuries, whenever a new element is added to an information hierarchy that highlights a limitation of the underlying classification, there’s no choice but to modify the classification of existing items to resolve the ambiguity. Much like orthographic reforms trying to change natural human languages and rewrite whole books in the new style, such re-hierarchization is a Herculean task that will never be truly complete.
Also, hierarchies are fundamentally unable to accurately express overlapping properties of a given object. For example, given a musical track that has an author, a musical genre, an album, and a year of release, different people will build incompatible folder hierarchies to express the same thing simply by virtue of ordering the hierarchy differently. Some will do it artist-first, some genre-first, some album-first, and others year-first.
And finally, when the time comes to realize that a file belongs to multiple regions of the hierarchy, because making non-overlapping information categories is all but impossible, users will get hurt by the sheer technical difficulty of putting a file at multiple places in a hierarchical filesystem. One can only be amazed and infuriated at how easily hard link semantics will end up confusing even the simplest program, but this only reflects how programmers aren’t used to making a discussion between a resource itself, and one of its locations in the filesystem.
Intermission : towards a unified hierarchical filesystem successor ?
So, let’s sum up what we have established so far. The UNIX filesystem exposes a uniform way to locate a system resource, which is an awesome idea. And it does so through a partially reconfigurable and highly extensible hierarchy that, in theory, should also make such resources more discoverable.
However, the concept of a universal hierarchical classification turns out to be fundamentally bogus for a large amount of items. This leads to the emergence of a large amount of incompatible implementations of the hierarchy, on one side, breaking the “uniform locator” concept. And it also leads to the emergence of overcrowded folders and meaningless classifications on the other side, breaking the discoverability aspect.
In this sense, hierarchical filesystems are in dire need for a successor, which separate the concerns of querying individual objects (that are most efficiently and reliably handled through some kind of UUID-based inode numbering scheme) from that of iteratively looking up data through increasingly precise filtering (that are best handled through true support for non-hierarchical file metadata, and a dramatic improvement to the metadata-awareness of operating systems and application programs).
But how would such modern information retrieval schemes cope with legacy filesystem implementations, when files are shared by e-mail and external drives? How would it avoid pollution in the metadata namespace itself, whose most straightforward implementation is hierarchical and vendor-specific? How could it handle the sensitive security question that increased metadata use will cause? How could it efficiently store, retrieve and parse metadata in order to avoid the fate of the likes of NEPOMUK, which broke down under its own weight? And, perhaps most important, how could it be presented to users in a non-technical and compelling way, without a need for ID3-ish extensive manual metadata input or questionable cloud metadata sharing practices?
Further discussing these questions will be the subject of my next post. So stay tuned next thursday for more on this project’s proposed solution to the longstanding OS resource access problem!