Here is the last episode of this year’s Big Summer Update series. While not technically released in Summer, a significant part of its editorial journey happened during the last third of August, so I think it roughly qualifies. In this post, we’ll be talking about some file management abstractions, why we have them, and what I would like to do with them. Fair warning : I’ll borrow quite a lot from Apple’s OS X, as I consider that they have done pretty well in that area.
Why do we have them ?
Mass storage media are, like pretty much every piece of modern computer hardware, relatively complex. When you’re lucky, you can view them as a continuous stream (or, rather, spiral) of data blocks, like floppy disks and CD-ROMs. When you’re not lucky, you meet the complex geometries of hard drives and the garbage-collected dirty blocks of flash memory (which cannot be rewritten without being erased first). In any case, careful driver programming is needed to abstract implementation details away while keeping good performance.
Even for the simplest mass storage medium, a continuous stream of rewrittable blocks, direct manipulation of the storage medium by the user is already cumbersome as soon as more than one content is stored. Imagine, as an example, that you have two text documents A and B on the mass storage device. Since we do not have a file abstraction, the user separates these different contents by remembering the position and size of each of them in the stream of blocks. At this point, it’s already pretty easy to see that this won’t scale well to a large amount of independent contents. But problems don’t stop there. Each time a content is modified, its size will probably change. This means that the user has to keep his knowledge of said size up to date. If a content grows in size too much and starts to “hit” the next content, the user has two choices : either relocate the content to another place where there’s more room (which will change its position), or split it in several smaller fragments (which requires explicit software support).
At some point, our user will give up on trying to keep all that in mind, and will take a piece of paper, give a name to each content, and keep track of its position and size on the paper. There we are : these (Name; physical structure) tuples are nothing but the crudest file abstraction possible. And here comes the answer to the “why files ?” question : files are the simplest (and thus fastest and easiest to implement) abstraction that can be used to refer to independent content on a storage device using human-readable names.
Now that we have equated files and content with a name, let’s see how users interact with content on their computer.
Due to the constraints of memory hierarchy, when a user wants to manipulate a file, the basic workflow of the software he uses is to bring all or part of the file’s content in a fast but volatile RAM and manipulate that copy. At any time, users may ask the software to update the contents of the mass storage device through the “save” command. That “save” wording is highly appropriate : every second which data spends in RAM, it’s in danger. One single software or hardware failure, and it’s dead. Which is why most experienced computer users have a strange form of Parkinson disease that makes them compulsively press Ctl+S or Cmd+S every time they bring a simple change to the content they’re interacting with.
The problem here is that the user is responsible for addressing a flaw of his hardware+software combination of choice : that he’s not directly interacting with his content, but with an extremely fragile copy. From the point of view of this user, an abstraction where content would feel like it’s manipulated directly, whatever happens internally, would be simpler, safer, and reassuring.
A growing amount of software acknowledge this. Office suites automatically save documents on a regular basis and prompt the user to recover the last autosave on startup. Mail clients automatically save mails currently being written as drafts. Web browser restore opened tabs after a crash. Everyone, on his side, does its part to reduce data loss, and I think that it’s a great thing.
The OS contribution which I’d like to bring here is a universal mechanism for “almost direct” file editing through automatic saving. Advantages would include less code to write, publicity for the feature among developers, possibility to centrally manage the auto-saving system through a unified interface, overall more homogeneous UI and concepts, and smart scheduling of auto-save disk commits to optimize some performance criteria (power management, throughput…).
I have also tried to investigate making software itself feel like it directly manipulates the file, in a variant of the memory-mapped file mechanism. However, it appeared to me that this was impractical for many use cases (especially when compressed file formats come into play), cumbersome to code, and in the end didn’t address the problem of developer cooperation being required (unless software commits are atomic, garbled content will be left on disk when the memory-mapped file is saved in the middle of an edit).
So here’s a more conservative auto-saving concept, viewed from a point of view of software developers. When a program opens a file through the system API, it is offered the option to subscribe to a regular “Auto-Save” notification from file management services. Each time this notification is triggered, the program is asked to save the current version of the previously opened file in a location designed by the OS, or notify the OS that no changes have been committed since last snapshot. When the file is closed, the notification is automatically disabled. This is all that application programs have to care about.
On the OS side side, implementing a basic illusion of direct file editing would be done by firing a notification to subscribed software every few minutes, the exact timing depending on user configuration and some other performance criteria (power management, disk throughput improvement). Each time, the OS would ask the software to save a new copy of the currently edited file, then replace the old version with the new version, thus guaranteeing that no application crash can leave files in a garbled half-saved state.
There’s more we can do with this mechanism actually. As some will point out, auto-saving is a software implementation of power users’ compulsive Ctl+S-ing, which itself has some issues. Imagine, as an example, that you’re coding some piece of software (A). Suddenly, you want to test an implementation change (B). If the change turns out to be rubbish in the end, so you cancel it to go back to the initial state (C). I don’t know about you, but I do this kind of experimentation fairly often with my code. Now, imagine what happens if the code editor crashes or the code editor is closed for some reason between B and C : unless itself saved in some proprietary IDE file, undo history is cleared, and there’s no easy way to revert back to the A state.
The solution to this has been known by good programmers for a long time : anytime you are going to perform big changes to your code, do a CVS/SVN/Git/Bazaar snapshot. Through the wonder of file versioning, one may easily dive in the old versions of your code when something goes wrong, and revert changes that turned out not to be a good idea after all. Main potential problem with this system is the need for a big server where all versions are stored : while for small files like code, it is envisionable (although messy) to keep all versions around, but for bigger files a boundary must be set in order to avoid filling up your hard drive.
Some examples of parameters that could be used to control snapshot storage would be :
- The age of the oldest snapshot. Unless under the explicit “save as” request of an user, there is no need to keep around versions of a file which said user has forgotten about, and these can be garbage collected away.
- The storage space allocated to a file’s snapshots, as a multiplier to the size of the largest snapshot.
Given we have all that, leaving the repetitive task of snapshotting to the machine can be done simply by making snapshots out of autosaves once in a while. During the auto-saving process, the newly saved file would not overwrite the old one, which would be archived, while the new file would become the currently manipulated file. Aside from such automatic snapshotting, which stems from the assumption that software and users do not necessarily know what a major change is at the time they perform it, software and users could also be able to explicitly tell when they have performed major changes, so that a new snapshot is taken (with possibly a user-defined name if it is user-created).
Okay, so that’s all that I plan to do with files. Next subject after them is of course their life companions, folders.
Why do we have them ?
The file is a powerful abstraction, one that is sufficient to abstract away any modern data storage device from users and applications alike behind the simple concept of independent pieces of content whose internal organization is the OS’ secret.
However, for those huge mass storage media of today that can store thousands to millions of books, files are not enough. If you have thousands books lying on the floor of your office, without any form of organization, it is pretty impossible to quickly find one in the middle, and the same goes for computer content on a HDD. Well, okay, I’m taking a bit of a shortcut there : if you precisely know the name of the file which you’re looking for, search engines may be used to retrieve the file. But this is a strong assumption. What about dyslexic people ? What about learning about unknown content on someone else’s borrowed hard drive ? What about finding the part of your own stuff which you have forgotten ? Finally, search would be a crazy resource hog if it was the only way to find a file on a device, which is why most modern indexed search systems are limited to the user’s home folder.
So there is a need for a primary mean of organization. What is the primary method of organization for real-world objects ? Logically grouping stuff together, using shelves and such. And putting a name on the groups as soon as they become too many. So okay, let’s introduce the same concept of named groups for files : there you have them folders !
Bundles : Putting several files inside of one
Applications storage in an interesting example of a situation where the needs of end users and developers apparently mismatch. On one side, most end users would like to treat applications as a single independent object : you (double) click on it to open it, like a file. On the other side, actually storing a piece of software within the boundary of a single file would be a terrible idea from a developer’s point of view, something that’s both a pain to implement and to manage. Developers would really like their software to be stored within the boundaries of a folder hierarchy that they control, in order to allow for on-demand content loading, ease coding of a plug-in system, speed up compilations, etc.
In the end, what both would agree on in this particular context is something that acts as a file for the end user who doesn’t know or doesn’t care, but is actually a folder with an internal hierarchy on the inside. Such an object has been implemented in the past by Mac OS X, in what Apple calls Bundles. You take a folder, you put some metadata inside of it (describing icons, command to execute on a double-click, loading screens), you write a “.app” at the end of the folder’s name, and voilà, there’s your folder acting like a file for the end user ! When needed, power users can still see the internals of a bundle on their side through a context menu option.
I think that this “Bundle” concept is great, and would love to use it plus or minus a few implementation details. However, I also think that it can be taken a bit further…
Bundle <-> file conversion
First, disguising a folder as a file is nice, but in the end it remains a folder. For local operations it makes no difference, but anytime you meet a low-level protocol that is explicitly designed to handle files, like the one that’s used for mail attachments, you’ll run into problems. Sometimes, there’s a need to actually make a bunch of files in a folder hierarchy behave like, well, a single file, for all intents and purposes. The fastest method to achieve this goal is to concatenate files one after the other in a big archive file, and add a bit of filesystem metadata here and there to reflect the internal file and folder structure. This is pretty much what the popular “tar” UNIX program does.
Here, what the OS could do is make sure that when software explicitly treats a bundle as a file, by passing it through an API operation which only applies to files, a “concatenated bundle” version is silently created and used instead. Conversely, when such a concatenated bundle is subsequently opened, it it silently turned back into a folder. While not pretty and probably also quite inefficient, such transparent conversion between a folder (usable) and a concatenated file (transportable) form of bundles would make bundles a better, more predictable “almost file” entity for the end user, while also keeping things simple for the developer.
Next, once the OS facilities for transparent folder <-> file conversions are there, why should we limit ourselves to concatenation as the method to actually operate this conversion ? While certainly the fastest, it is not necessarily the most space efficient. Imagine, as an example, a developer who wants to distribute a little freeware on the Internet. Speed is not so important to him, but he certainly wants the thing to be small. So what he would like is a compressed variant of his bundle, using some variant of LZMA or whatever memory-efficient compression algorithm is fancy at that time. Provided that the OS knows about the compression algorithm being used (which the developer should probably check), silently uncompressing LZMA bundles like concatenated ones would be a nice touch, and would only require very few extra code.
Going beyond application storage
Bundles, in the general sense of “files” with an internal folder hierarchy, could have other uses than local application storage.
First, modern computers can do a lot, and modern file formats reflect this by being very complex. More often than not, it makes sense to store an internal file hierarchy inside of a document. For two well-known examples, office file formats do it, some in a proprietary way and some simply by ZIPping up a bunch of folders, and every multi-layered image file format (PSD, XCF…) does it too in its own way. So, what about a standard system abstraction for that instead of reinventing the proprietary wheel every time a new file format is needed ?
Also, I have talked earlier about the concept of automatically saving file snapshots. A question that has to be asked at some point is, where should said snapshots be stored. The answer, in fact, depends on what your priorities are. If you want snapshots to have a simple implementation and to be easily transferred from one computer running this OS to another, using a special kind of bundle to store them would be a simple way to achieve that. If you want other OSs to easily deal with your files, you’ll probably prefer to store snapshots somewhere in a hidden part of the hard drive which only your OS is supposed to see. I’m not yet fixed on what my priorities are on those matters, just showing another possible application of a folder behaving as files.
So now we have discussed file and folder handling, and I’d like to discuss a third part of file management which in my opinion is highly important : the default folder organization followed by the OS.
Common directory structure
So far, we’ve been considering the two basic building blocks of a file system, the file and the folder, from the point of view of how software and users deal with them. However, on a fresh OS install, the file system does not come blank. The OS has to provide a standard organization to access external drives and other partitions, home user directories, system files, software, and whatever else the system comes bundled with and is supposed to be extended with.
According to this Wikipedia page, the conventional name for this standard organization is “Common directory structure”. To make sentences more readable, I will refer to it as a “directory structure” only in the remainder of this article.
Now, I’d like to have a look to the directory structure of used installs of Windows and Linux. It is a voluntary choice to deal with those instead of the theoretical directory structure devised by the OS manufacturer as seen on a minimalist fresh install from an unaltered retail version. It shows what users will actually deal with.
Directory structure of Windows 7
Like its DOS predecessor, Windows exposes storage partitions as parallel universes named using a drive letter (A:, C:, etc…). As for anything NT-related, things are a bit more complicated under the hood, but that’s the interface which most users and developers will deal with. For historical reasons, the A and B letters are reserved for floppy disks,meaning that on most modern computers drive naming will begin with C, which is the system partition.
Here follows the root of the system drive on my Windows install. This is a Windows 7 Home Premium 64-bit install that’s about one year old and has been treated with no specific care (e.g. software is installed with standard parameters and uninstalled using the add/remove software part of the Control Panel).
- ASUS.SYS : Remains of the installer for the “ExpressGate” secondary OS provided by the manufacturer
- $AVG : Stuff left there by my antivirus
- Boot : Stuff related to system boot, also includes a copy of MemTest, probably here as a secondary boot options
- Config.Msi : Origin and purpose unknown
- Cygwin : Place where Cygwin, an UNIX-like environment for Windows, is installed
- Document and Settings : Link to the “Users” folder, exists for compatibility with older releases of Windows
- eSupport : More manufacturer stuff, this time it includes manuals, drivers, and other stuff
- Intel : Stuff left there by the Intel GPU driver installer. Contains nothing of value, just links and installer logs.
- lazarus : Place where Lazarus, a delphi-like IDE, is installed.
- MicroProse : Place where Worms Armageddon, an admittedly old game, is installed.
- MinGW : Place where MinGW, a port of the GNU development toolchain (gcc, binutils, etc…), is installed.
- NVIDIA : Somewhere in the depths of this deep folder hierarchy, a copy of the installer for the NVIDIA GPU driver can be found.
- Octave : Place where Octave, a MATLAB clone, is installed
- PerfLogs : A large number of information and statistics about my computer, both spread in XML files and in a human-readable HTML version. Took me some time to figure out about this : I once ran a standard Windows tool which provides you with lots of interesting information about your system. Apparently, it silently saved all this information behind my back and never deleted it.
- ProgramData : Place where system-wide program data, caches, and settings should be stored
- Program Files : Place where all programs should be installed. Also includes the following folders…
- Common Files : Some files which are shared by several programs, placed there through unknown rules
- MSBuild : No actual software, but a piece of the system software called “Windows Workflow Foundation”
- Reference Assemblies : A part of the .Net Framework installation from the folders and DLL names
- Uninstall Information : An empty folder
- Windows NT : Some software bundled with the system apparently randomly ended up there
- Windows Sidebars : Contains no actual software, but only Sidebar gadgets shared by all users
- Program Files (x86) : Place where 32-bit programs, which are the vast majority of Windows programs available today, actually end up. Also includes the following folders…
- Common Files
- Downloaded Installations : Contains a single installer, which from the folder name apparently ended up there because it was downloaded from the Internet in a special way
- InstallShield Installation Information : Contains actual software uninstall information
- Microsoft.NET : No actual software, only a random data file
- MSXML 4.0 : Contains an EULA. Nothing else.
- Reference Assemblies
- Temp : A well-named empty folder
- Uninstall Information : Still as empty as before
- Windows NT
- Windows Portable Devices : Contains a single DLL, bearing the cryptic name of “sqmapi”
- PyGrenouille : Place where PyGrenouille, a software collecting data for the internet connection quality testing network grenouille.com, is installed
- Python26 : Place where the Python 2.6 interpreter is installed
- Recovery : Contains files used to recover lightly wrecked Windows installations
- $Recycle.Bin : Contains files with cryptic names, but I’m pretty sure this is something like a system-wide trash
- System Volume Information : Contains files used by Windows’ System Restore feature, another way to recover lightly wrecked Windows installations
- Temp : An empty folder
- tmp : Contains lots of Blender data files, sounds like some sort of cache used during the rendering by Blender
- Users : Contains one folder per user, where user-specific data is stored. Also contains the special users “Default” and “Public”. Default is the user profile which is used when a new user is created, and what’s inside “Public” is shared by all users.
- Westwood : A remain from an old install of Tiberian Sun, only a few random files left
- Windows : Where all “core” system files are supposed to be stored, more on its content later
The root of the system folder also includes the following files…
- AdobeReader.log : A mysterious file whose content are “Adobe Reader 9.1 MUI
Build Date for Win7 : 2009/06/15”
- aqua_bitmap.cpp : A C++ snippet that couldn’t compile on its own, containing the definition of a small bitmap
- bootmgr : Stuff used for system boot
- BOOTSECT.BAK : Backup of the system partition’s MBR
- devlist.txt : An extensive list of detected devices on the various system buses and their driver’s status
- Finish.log : Content is “Finish”. Mystery…
- hiberfil.sys : File where Windows moves the RAM’s contents when the system is put in hibernation mode
- if.log : A huge heap of software installation logs, which includes the mysterious “Downloaded Installations” mentioned above. Possibly a log left by the manufacturer when he installed the bundled software on the Windows system
- inject.log.txt : Another huge installation log, but which seems to work at a lower level (references DISM and WIM images). Perhaps a Windows installation log.
- N61JV.BIN : If I were to guess, this is probably a copy of the BIOS of my laptop. Now, what exactly it is doing here is a mystery…
- N61Jv_WIN7.30 : Contents are “WIN7 Driver_CD 2.91”, I am not amused
- OFFICE2007_L.TXT : “Office Pro 2007 Hybrid1
Build Date for Win7 : 2009/06/12”
- pagefile.sys : This file is used by Windows for swapping
- Pass.txt : Another mystery… “Pass date : 04-12-2010
ASUS.SWM File size Low WORD : F3BE669A
ASUS2.SWM File size Low WORD : F3BD38C3
ASUS3.SWM File size Low WORD : ED79F4BC”
- Patch_Win7.log : An outdated list of Windows patch names, probably left here by the manufacturer
- RECOVERY.DAT : A scary name for a grossly useless file, contents are “N61Jv”
- RHDsetup.log : The installation logs of a Realtek HD Audio driver
- setup.log : Another setup log left around by the manufacturer
- store.log : “Store”
- SumHidd.txt : Origin and purpose unknown
- SumOS.txt : Origin and purpose unknown
- v82.txt : “TOOL CD Version V8.2 B02”
An immediate conclusion when looking at the organization is that there is none. Or, more precisely, that the Windows teams attempted to gave the thing a structure, but that third-party programs — and even Windows itself — do not respect it much. This pollution probably results from intentional laziness from developers (It is faster to hard-code C:\<file name> as an installation path than to lookup an MSDN doc on the subject) and legacy development practices dating back from the DOS era, where everything could fit on the root of the system drive.
The reason why all this still has an impact is that Windows is an operating system with a long history which has chosen compatibility over cleanness early on. As the original design of DOS turned out to be deeply flawed in the context of modern personal computing, Microsoft chose to keep pretty much every single user-visible part anyway, only removing stuff as an extremely slow pace. As a notable consequence, they kept around the “installer” concept, which allows software to do everything to the operating system as part as its installation, a major offender as far as hard drive pollution is concerned (and also a nice malware introduction vector).
For even worse pollution, I could have described the contents of C:\Windows or C:\Users\<name>\AppData, whose organization is crazy beyond repair, but that would have taken too much time. Also, although I personally disagree with this point of view, one may argue that pollution within these folders is only a problem for power users, and as such is unimportant.
Directory structure of Fedora 14
Linux, like every member of the UNIX family, treats the system partition in a special way as compared to other storage media. Said partition is represented as “/”, and is the root of the virtual file system. Other partitions are accessed through a mechanism called “mounting”, where their content appears in a user-specified folder somewhere in the VFS hierarchy. While a godsend for advanced system administration tasks, manual drive mounting is suitable neither for beginners nor for everyday use (“Where has my USB pen drive just went ?”), so in practice modern desktop Linux distributions generally use a scheme where all available internal and external partitions appear in the GUI without actually existing in the filesystem, and are automatically mounted on a generic directory when “opened”.
If one opens the root of the VFS on my Fedora 14 x64 install, which has been treated in the same way as the Windows one (default installation procedures, no special cleaning care), one can find the following folders :
- bin : Contains the main executables of some programs, or symlinks to them, as a flat heap of files. Officially, programs in this folder provide commands which are necessary when booting the kernel in single-user mode (ie without multi-user abstractions). However, what’s “necessary” is obviously a relatively blurry notion.
- boot : Stuff related to system boot, kernel images and bootloader files mostly.
- cgroup : An empty folder. Probably related to the “cgroups” process grouping feature of the Linux kernel in an unknown way.
- dev : Contains virtual text files associated to physical or virtual peripherals, in accordance with the UNIX design principle that everything which can be abstracted as a file should be abstracted as a file. Mostly a flat heap like bin, but some attempts at folder-based organization have been made for some arbitrary classes of devices.
- etc : Contains system-wide configuration files, arranged in an anarchic fashion.
- home : Contains each user’s personal folder. Aside from personal files, these folders contain a huge anarchic heap of hidden files and folder. Akin to Microsoft’s AppData folder, these form together the per-user configuration and program data.
- lib : Contains a flat heap of 32-bit (non-native) shared libraries, which officially must be needed for programs in /bin to run, along with a link to the “cpp” binary for some obscure reason. Also includes the following folders :
- alsa : Some configuration files for ALSA, the standard sound management component of the linux kernel.
- crda : Some files from another kernel component, which ensures that wireless communications follow legal constraints (e.g. in terms of output power, frequencies, bandwidth…).
- firmware : Binary firmwares for a wide range of hardware, used by the Linux kernel.
- i686 : Contains the “nosegneg” version of C libraries, which accesses memory in a special way that makes Linux run faster above the Xen hypervisor.
- kbd : Files used by the kernel for text I/O. Contain console fonts, mapping of keyboard scancodes in a character set (aka “keymaps”), translation tables between different character sets…
- modules : Contain kernel modules and the associated configuration files, for each installed kernel version. The attentive reader will have noticed that this is 64-bit code.
- rtkaio : Library files used to access the Linux kernel’s asynchronous I/O facilities, which get their own folder for mysterious reasons…
- security : An empty folder. I’m trying very hard not to see a symbolic meaning there.
- systemd : A heap of config files and folders from systemd, one of the many replacements to the “init” program from System V UNIX, which is the first process to run after kernel is ready.
- terminfo : Descriptions of terminals, ie text I/O devices.
- udev : Lair of udev, Linux’s device manager. It listens to hardware hotplug events from the kernel and, according to a set of rules, runs further hardware configuration routines and creates or deletes content in /dev.
- upstart : Binaries and links from upstart, an older init replacement. Apparently, story is that Fedora 14 was scheduled to switch to systemd, but sticked with upstart due to last minute issues.
- lib64 : Contains a flat heap of 64-bit (native) shared libraries, plus the following folders :
- dbus-1 : Contains the file “dbus-daemon-launch-helper”, a part of a larger system service called “D-Bus” which implements some high-level IPC primitives on top of the Linux kernel.
- device-mapper : A part of libraries allowing software to access the “device mapper” kernel service are stored here, for unknown reasons.
- multipath : Libraries controlling the system’s abstraction of multipath connectivity, that is, when several hardware paths lead to a single device.
- rsyslog : Libraries allowing use of the syslog log message standard (and then some more).
- rtkaio : Yes, the same as before. Same goal as before. Guess it’s the 64-bit version.
- security : Lots of libraries (and a config file) controlling PAM, for Pluggable Authentication Module, a system for abstracting authentication mechanisms away. (Program A can rely on Program B authentication without relying on a specific method or implementation of authentication)
- tls : This folder is empty.
- xtables : Lots of libraries controlling the internal firewall architecture of the Linux kernel.
- lost+found : In a situation where the filesystem becomes garbled, e.g. as a result as a power outage during a disk write, a full FS check can locate file fragments and put them here, in the hope that the user can recover some data from them.
- media : This is where removable drives and other unmounted partitions accessed from the GUI are automatically mounted.
- mnt : This is where partitions used to be temporarily mounted in a distant past. Empty, probably kept around because of nostalgia value and compatibility with old system administration tools.
- opt : Contains some programs that describe themselves as optional (whatever that means)
- proc : Intricate heap of virtual files within a partially organized folder hierarchy, similar to /dev. Files in this folder together describe some kernel and process state in a text form.
- root : Home folder of the system administrator, or root user.
- sbin : A unidimensional heap of programs, very similar to /bin in structure. According to the Filesystem Hierarchy Standard, the difference between both is that /sbin is for “essential system binaries” whereas /bin is for “essential command binaries”.
- selinux : A number of files from SELinux (Security-Enhanced Linux), a set of patches to the Linux kernel and user-space tools which aims at improving Linux security, noticeably by introducing per-binary resource access restrictions. While the intent is good, current implementation suffers from its nonstandard status, breaking some software which is not built for it and does not expect its presence.
- srv : This directory has a blurry purpose (“Site-specific data which is served by the system”, to quote the FHS), and is empty anyway.
- sys : Provides information on the running system through text files and allows configuration of it. Very similar to /proc, but more limited in purpose (everything comes from the kernel, no information from other processes), and has a fully organized folder hierarchy that makes it much easier to navigate.
- tmp : Temporary files, typically flushed on system reboot. Preferred to mallocated RAM by some programs.
- usr : Defined as “read-only user data”. Includes the following folders :
- bin : Listing the contents of this folder generally takes a few seconds. It contains most of the binaries of the operating system, in a unidimensional heap similar to /bin. And two folders, gda_trml2html and gda_trml2pdf, home of two python applications which are made of several non-independent files.
- etc : According to the definition of usr above, this folder has no reason to exist. And indeed, it is empty.
- games : For some reason, some games consider that they have a special place in the Linux application ecosystem and must get their own folder. Maybe it’s that since games typically require lots of private data (levels, sprites, 3D models, sound…), they do not work well with the standard folder hierarchy of UNIX systems, which is designed for lots of small binaries which rely on shared libraries and do one simple task.
- include : Header files of some libraries, used in software development and compilation. Some libraries, typically large ones, store their headers in a folder, whereas others directly put them in the include directory, with the risks of namespace collision that this brings.
- lib : Cf /lib, but bigger. A lot more folders this time, so I can’t cover all of them in this review, but the existence of some definitely has to be questioned. Especially the empty “games” and “java-x.y.z” ones.
- lib64 : Another folder where listing contents will temporarily hang your file explorer and make your fan go wild. Cf /lib64 globally, lots of folders here too.
- libexec : Coming straight from the BSD world, contains binaries which are supposed to be used as a library (ie not called directly by the user). And more of them in subfolders.
- local : Yet another remake of the folder hierarchy, akin to /usr, but this time for “local data, specific to this host”. I won’t bother detailing its contents much, it’s yet again bin, etc, games, lib…
- sbin : “Non-essential system binaries, e.g. daemons for various network services”, according to the FHS.
- share : “Architecture-independent (shared) data”.
- src : When you install the source code of a program, typically the Linux kernel, it will end up there.
- tmp : Link to /var/tmp
- usr : Probably the remains of a failing installer, contains a local/lib/security folder hierarchy which leads you to some library files controlling PAM.
- var : Location where “files whose content are expected to continually change during normal operation of the system” are stored. Contains the following sub-folders :
- account : Contains an enigmatic empty “pacct” file.
- cache : Cached data, that is stuff which is here to speed up some operations but can be safely deleted if needed.
- db : Contains a db/sudo folder hierarchy. Opening the last folder requires root privileges, but there’s nothing inside.
- empty : This folder is not empty, but contains an empty sshd folder, which again requires root access to be opened.
- games : Contains system-wide data files of some games : logs, records, list of dead characters for ghost generation purposes…
- gdm : Empty folder.
- lib : According to the FHS, this is “State information, persistent data modified by programs as they run”. Guess that the aim is to provide an alternative to malloc which can survive a program crash for state recovery purposes.
- lock : “Files keeping track of resources currently in use”. Overlap with the last folder ? How so ? :)
- log : Log files, place where some programs store their debug output so that people can read it in case something goes wrong
- mail : Mailboxes of users, used as a mean of inter-user communication on some networks
- nis : Empty folder whose role will remain a mystery
- opt : Mysterious empty folder
- preserve : Apparently not so well preserved, since this folder is empty
- report : Empty folder
- run : Information about the running system’s state. I don’t quite see why this and /proc both exist, perhaps this is meant to survive a system crash whereas /proc is a RAM construct ?
- spool : Places where a process writes data that is to be handled by another process, as a crude form of IPC. Used for documents to be printed, incoming mail…
- tmp : Temporary data that is not to be deleted when the system reboots.
- www : Mysterious folder, looks like the remains of an Apache install that was only partially removed
- yp : Empty folder, perhaps dedicated to the storage of Youtube Poops
Also, there’s a “null” file on the root directory. No contents inside. Guess it was previsible.
Linux is a heir of UNIX, which had bet everything on textual data and filesystem-based OS abstraction. Everything was to be treated as textual data stored somewhere in the file system, whenever reasonably possible. Logically, modern Linux desktop has received from its roots a very complex, yet partially organized and human readable file system structure. Also like UNIX, the Linux ecosystem attempts to base itself on a very wide array of tiny cooperative programs and libraries which each do simple tasks. These spread anarchically on a common namespace, and are supposed to share their architecture-independent data (icons, etc…) with each other.
While this design is beautiful in theory, an attempt to make it scale to the complexity of a modern desktop operating system results in what’s best described as an intricate mess. Folders get overly crowded, and can hardly be parsed by a human being. Unices rely too much on developer cooperation for filesystem structure preservation, like Windows, with a similar punishment : each developer who chooses not to follow the FHS, either because he does not know about it or because it doesn’t fit his use cases well, will add complexity to the overall file system, making it more and more crowded and mysterious. What’s more, a Linux system is based on a huge number on strongly interacting actors which regularly introduce breaking changes in new versions. This means that software frequently has to be rewritten and/or recompiled, or that compatibility kludges must be introduced to make it work, complicating the overall FS structure even further.
As a consequence of all this, the contents of a UNIX file system is not discoverable. Like with CLI commands, you have to know about what you’re looking for in order to find it, resulting in a user-hostile culture of endless boring documentations, often written in plain unformatted text for further injury.
Another aspect of the UNIX culture of everyone sharing everything with everyone else is that Linux distributions are strongly reliant on the existence of centralized package repositories and package management systems. Those introduce single points of failures in both OSs (a package manager database corruption is pretty much a game over) and the general ecosystem (if you want to make a malware spread, you only have to make sure it enters the repository). They also require considerable human and financial means to work, and typically cause new software releases to reach Linux distros very late in their life cycle. Finally, since Linux is just a kernel and package management, like everything else, is not a standardized feature, considerable effort duplication occurs as each Linux distribution has to re-package a given new software release in its own way.
All in all, I believe that the benefits of the UNIX approach to OS design do not outweigh its drastic cost in the modern era. And that it results in Linux, like Windows, becoming a mess in the end from the user’s point of view.
Well, now that I’ve mocked the existing stuff a bit, it’s time to take a more constructive point of view and attempt to design something that would work better, in the context of personal computing that I target with this project. When designing the directory structure presented below, I had the following goals in mind :
- Being comprehensible without a need for heavy documentation or lengthy experimentation
- Treating applications as isolated entities (as this is the model that fits best most desktop and mobile software). This noticeably involves attempting to avoid sharing of data and libraries between applications at all cost, as this behavior is seen as a major source of fragmentation and structural complexity whose benefits do not outweigh the cost.
- A clean concept surviving future evolution and constant attacks from poorly coded or intentionally offensive software
- Avoid having hundreds of files end up in a folder, so that users can quickly parse their content. Prefer hierarchization where it is a practical option.
- Do not prioritize multi-user operation. Personal computers are in essence relatively individual machines which rarely see more than 10 installed user accounts and 1 user logging in at a time. As time passes, they are becoming more and more individual, as can be observed with newer designs such as netbooks and tablets. Multi-user still has its place (e.g. for familial desktops) and must be taken into account, but it is not a central feature anymore. It must not dictate the rest of the system’s design. One of the consequences of this is that regular users would be trusted significantly more than on other OSes as a default settings, and that the role of the root account as compared to a “normal” accound would be restricted to that of user account management and access to all private user folders. In an acknowledgement of personal computing’s specific priority, security is achieved through application software sandboxing more than user privilege restriction : rogue application, not users, are the main enemy.
- In the same vein, mounting is a powerful power user feature, but it is highly complex in essence and its existence must not be imposed on unskilled users which don’t need it, and the OS should not be built around it like Unices are. The use case where mounting is fully automatic, with drive partitions coming in and out as devices are plugged in and removed, must receive special attention.
To meet this goal, I suggest the following design : like on Unices, there is a root directory, also called / for the sake of consistency. Within this root directory, one can find the following folders :
- Applications : Contains installed application software, in a one-bundle-per-software basis. Experience of Windows and Mac OS X shows that the most straightforward “flat” storage fails here, so hierarchization must be envisioned, like on Linux DEs’ main menus. Contrary to what happens on Linux, however, main menu structure would strictly follow the internal hierarchy of the Applications folder, avoiding the need for two separate databases which always end up out of sync. However, this means that the hierarchy of the Applications folder is of extreme importance, and must be decided with extreme care by the time it is rolled out (which will come much later in the development of this OS)
- Storage Media : Main content is all accessible partitions but the system one, designated by their label name. Partitions are internally managed through some unique identifier which only depends on user-inaccessible information such as the drive’s serial number, similar to a linux UUID. If we assume for a second that we use said UUID scheme (it is not decided yet), the implemented VFS could present an interface where Storage Media has a “By UUID” subfolder where the actual drives are mounted, the “drives” of the Storage Media are nothing but dynamically updated hard links to this folder’s contents.
- System : This folder contains the core OS files, whose boundary is defined as anything that is needed to achieve full compatibility with application software and underlying computer hardware, and to provide the standard interface functionalities to the user. The internal structure of this folder is not defined yet and is subject to change after definition anyway, but subfolders such as “Boot”, “Drivers”, “Shared Libs”, or “Services” could be expected. GUI should probably display a strong warning when users attempt to tamper with the contents of this folder by the user, with an option to cancel the action in progress and one to never show the warning again for OS hackers and other power users.
- Users : This is the place where each user’s private files are stored, in a fashion that is not without reminding of Unices’ /home. Contrary to what happens on Unices, though, as a default setting, non-root users can go everywhere on the drive but in other users’ folders, even though such permissions can be restricted by the root. Another crucial difference is that contrary to what happens in UNIX-inspired designs, this folder is purely the user’s property. In technical terms, this means no hidden heap of configuration files (I will explain where these go in a minute), and very restricted access of applications to that folder (such access requires explicit user permission, such as handing the file’s name in a command line parameter, double-clicking the file in the GUI after having associated the application with it, or asking user to designate the file through a standard system dialog. As in other systems, there is an “Everyone”/”Shared”/”Global” folder, accessible by all users, which a user can use to share files with other users of the computer.
At this point, to better picture how this design would work, I think it is best to describe how the application bundle system would work in more details. So we have our application bundle, with a “.bundle” or whatever else will be used. This bundle is installed in the “Applications” folder the first time the user attempts to run it, as the system detects that it has no entry for this application in its database and creates one in a standardized setup procedure. Conversely, application removal is done by removing the bundle from the Application folder. It is the system’s responsibility to discard all internal tracks of the application that it has created behind the user’s back when this happens.
Let’s have a closer look at what a bundle’s contents could be now. This is not a definitive description, just one which I feel comfortable right now and would like to try if no one else sees flaws with this design :
- In a “System”/”Descriptors”/”Meta” subfolder, one can find text files and resource files which provide the OS with the information it needs to know about the application. This includes an explanation that this bundle contains an application, its icon (because generic icons are dull…), perhaps a description for use within GUI, the list of special security privileges which the application requires to run (if any), etc.
- To enforce use of the central system updater (as opposed to annoying per-application updaters) and reduce exploit impact, most of the applications’ files within the bundle are read-only as a default, and writing to them requires a special system permission. However, it is obvious that applications will also need to keep some kind of state on disk accross launches (configuration files, caches, history, game high scores…). To this end, a “State” sub-folder, in which unprivileged applications may write data (but not create executables) is available. This folder must be initially empty, and removing its content must result in the application going back to its post-installation state, modulo the changes brought by updates.
The way I suggest to use this folder is as follows : applications normally do not directly manipulate the contents of the “State” folder, but can instead access through system abstractions a “system-wide state folder” and a “user-specific state folder”. When system-wide state is written, the system creates a “Global”/”Shared” sub folder inside of “State” and lets software write the data into it. When user-specific state is written, the system creates a folder bearing the current user’s name inside of “State” and lets software the data into it. Overall, the “State” folder follows the structure of /Users and its access permissions, but it does not have to keep in sync with the current user database when new users are created, only when users are deleted.
This means that if an application is exploited, the only data which it can leak is the data which it has itself written, and not the full contents of a user’s home folder as is the case with current multi-user security models. As a counterpart, backing up state data without backing up the software which generates it, as is currently easy to do on Linux or Windows NT, is more difficult. Notwithstanding, I believe that my side of the compromise is more interesting for use on individual machines.
- The application is free to choose whatever layout it wants for the rest of the bundle’s contents, even though some templates can be suggested for specific kinds of applications.
Let’s sum this up.
Modern mass storage media hold lots of data, so abstractions must be used to organize content in a fashion that’s comfortable to the human brain. Two common abstractions are used for that purpose, files and folders. Files are used to separate different contents, folders are used to organize files hierarchically.
As mass storage devices are slow, files are generally copied to a chunk of dynamically allocated RAM and edited there. This is a problem because that mallocated RAM is vulnerable to both power failures and software crashes. Hence we want software to frequently save a file that’s opened in RAM back to disk. I suggest a system service takes care of that. An API abstraction could be a periodic “auto-save” notification, which asks software to save any currently opened file in a specific location of the mass storage device. This solution has the advantage of being opened to future evolutions of the way the OS stores automatically saved snapshots. It also allows for the addition of another interesting feature, making automatic snapshots of files that users can use like a long-term and crash-proof undo history.
Folders are normally an innocent user-visible filesystem organization unit. However, one can also imagine creating a special kind of folders, which normally behaves like a file for software and users and only reveal its folder nature to the OS and its curious power users. Such a “black box” has a variety of application, most important of which is software storage, as it provides a simple abstraction of software sandboxing at the filesystem level.
The virtual file system of an OS (and the underlying system drive physical file system on which it is based) is not provided to the user as a massive heap of files, it is supposed to provide a standard folder hierarchy in which information is stored in a clearly organized way. However, if we take real world installs of current desktop OSs such as Windows 7 or Fedora Linux 14, we can see that this standard hierarchy is a bit complex to start with suffer from quite a lot of deviations from the original model as the OS install ages. I believe this latter problem is due to current OSs letting software spread too much on the hard drive, a behavior that I hope can be suppressed by restricting applications to operate in a “black box” folder such as the one described above, unless under the explicit permission of the user. However, I guess that only real-world experiments will tell.
In this article, I also describe a preliminary vision of what the common directory structure of my OS could be.
And this is is where this post and series end, folks. Took me quite a lot of time to finally get this article out of the door, due to issues stated above. For the future, I have two article concepts drafted and lots of kernel code to write. I believe I’ll start with the latter one, and use the former ones as part of my regular blog updates, but only time will tell if I’m finally ready to go back to regular OSdeving again, or if I’ll have to stick with more irregular updates for now.