Secure file handling

Files are one of the most frequently used OS abstractions. They provide a convenient way for programs to produce long-lived output, and to store persistent state inbetween runs. From a user point of view, files also provide a standard, application-agnostic way to share any kind of data between computers, to be contrasted with the hoops that one has to jump through on those ill-designed mobile operating systems that do not have a user-visible file abstraction.

The power of files lies in how by using them, programs can store and retrieve named data without having to care about where and how it is stored. But such great power comes, as usual, with great responsibilities, which is why the abstraction can get pretty complex under the hood. After discussing a bit the UX of files earlier this year, I will now discuss the security implications of typical file manipulations.

The security of saving

Though some people like to create empty files for the fun of it, using utilities like UNIX’s touch, files are usually created as a result of applications saving data, that is, making a named copy of some fragile piece of data that is stored in RAM on a slower, but more robust mass storage device, such as a hard drive or a SSD.

Naming files can be as simple as giving them a unique number, or inode in UNIX slang, and that is what most filesystems do at their core. User-visible textual names and other metadata are solely provided for developer and user convenience, as part of a scheme for organizing data that is not necessary in order to store it and load it from the file system.

Caching

Mass storage devices that are available to the general public today are orders of magnitude slower than volatile computer RAM. Thus, relying on some data that is stored there is a common source of performance problems for applications. To mitigate this effect, most operating systems have a strategy in place to keep a copy of frequently used files in RAM, called a disk cache, and have applications access that fast copy instead of the slow real thing. The device version is then updated regularly from the cache in the background, when the computer is not busy.

Like all forms of caching, this disk cache is bad news from a reliability point of view. Whenever two copies of a piece of data that should look the same exist, it raises the challenge of keeping them in sync, called the cache coherency problem, and this programming challenge is hard. The problem, in this case, is twofold :

Software developers are not aware that the disk cache exists, so they will assume that whenever a disk write is reported to be finished, the data is actually on disk.
For legacy reasons, in many OSs, the disk cache was added by modifying the semantics of synchronous file writes, which previously did not return to application code until the file write was actually commited.

In short, to implement a disk cache, most operating systems did a very bad thing, which is to change the contract that OS system calls make with developers, while keeping the illusion of compatibility. Cue decades of misunderstood programming tutorials giving young developers the illusion that disk writes are truly synchronous, and the ridiculous fflush() command which breaks the transparency of the disk cache abstraction, and is asking for trouble by demanding that developers write two separate lines of code to perform one single common task.

Besides race condition problems, disk caching also diminishes the failure resistance of computer systems. It drastically increases the amount of trouble that can arise when software crashes, disks are removed from a computer without prior notice from the user, or the entire OS is forcibly brought down by a computer power failure. Secure software and OSs must ensure data availability by making sure that no software bug or crash can leave a mass storage device in an inconsistent state, and current implementations of disk caching that masquerade cache writes as synchronous writes make this much harder. This is why, for example, OpenBSD has this problematic feature disabled by default.

Finally, a more minor concern related to disk caching, which is still worth pointing out, is that any data that is stored there is obviously very sensitive from a security point of view. For performance reason, disk caching occurs before files are encrypted, so the disk cache contains a partial clear-text copy of potentially sensitive disk data and must thus be kept safe from the prying eyes of unauthorized process. But current disk caching implementations do a pretty good job at this already.

Resilience to system failure

As mentioned above, secure operating systems must guarantee the availability of data to the fullest feasible extent. In the event of unforeseen failure such as software crashes or hardware breakdown, the system, and the abstraction it provides to applications, must ensure that the largest possible subset of live data is kept intact.

In the case of software crashes, a common cause of data loss is a broken user interface metaphor that has been around us for a couple of decades, the “save” metaphor. This metaphor exposes the inner workings of software by making us users aware that software can only efficiently work on in-RAM copies of data, and by holding us responsible for the task of regularly requesting it to be committed to a safer permanent storage media. Like any UI metaphor that forces users to continuously do something, or else bad things will happen, this is a recipe for disaster.

A very promising alternative has been proposed by Apple in OS X around version 10.7 “Lion”. The new UI metaphor is designed to give a user the illusion that software is directly working on the permanent copy of data, while software is periodically requested by the operating system to commit unsaved data to disk in order to keep this illusion realistic. Unfortunately, this never really worked in OS X, because one does not simply change basic interface metaphors in an existing operating system and expect users to cope with it well. But I think their idea was great and deserves more attention from newer operating system that aren’t held back by legacy user habits.

Besides software failure, another cause of data loss to account for is hardware failure. Modern mass storage devices are little wonders of precision manufacturing, which would require continued maintenance to keep working indefinitely, receive none in practice, and are thus doomed to fail eventually. So since we cannot prevent them to fail, we need to ensure that when they fail, no significant data loss occurs. An unrelated scenario that has the same effects of hardware failure is storage devices (or sometimes entire computers) being stolen, as mentioned above. This is the reason why we have backups and RAIDs.

The basic idea behind these technologies is that hardware failure events are relatively rare and unrelated, and thus if we can keep data duplicated in multiple places, we dramatically reduce the odds of all copies disappearing at once. Many techniques exist, each with its set of pros and cons :

RAIDs use a number of identical drives to keep multiple copies of the data. They are the only technical solution that guarantees near-perfect synchronization between all data copies, but they also take up room, make noise (when built out of hard drives), and are vulnerable to simultaneous failure of all disks involves from e.g. robbery or power surges
Backups on multiple external drives, stored in different places and never plugged in at the same time, is more robust, but it requires manual user action and thus can never be performed as frequently as other options
Online backup to professional data centers is more convenient, and can be as secure as local backup if data is encrypted before being sent there, but this solution tends to be pretty expensive per gigabyte and to put a lot of strain on the Internet connection backing it

On this front, it is currently impossible to devise a reliable long-term data protection plan without some direct user action. The role of the OS is then to encourage and facilitate best practices from the user in this area, by making it as easy as possible to do the right thing, and as hard as possible to do the wrong thing.

Resilience to user mistakes

A different class of data loss occurs when users modify a piece of data incorrectly, for example by deleting a file that should have stayed there, or changing something to a file before realizing that it was better before. This is why all well-designed software has an “undo” command. When it comes to file management, an operating system cannot prevent these events, because it cannot distinguish the right from the wrong in user actions, but it should provide a way to cancel and roll back almost any user action that proves unwise in hindsight, without relying on applications to do the right thing instead.

Of course, naively porting the “undo” metaphor of live data manipulation onto file manipulation would be inappropriate. Storing every single version of a file and every deleted file forever would cost prohibitive amounts of storage space, and for users, searching through all that untagged data would amount to finding a needle in a haystack. For this reason, a coarser-grained and better-documented form of file versioning would be desirable. And developers have long had the right tool for this around, in the form of version control software like SVN or git.

Apple tried, again, to integrate such coarse-grained file versioning into OS X back in Lion. Unfortunately, they were held back again by the legacy of their existing user base, which were expecting files to continue working the same way as they always did. For this reason, they stripped off version description, which is very important for bookkeeping, and explicit version creation by users, which assures that the version of files that are kept actually matter to users rather than being unfinished business. And in the end, it was still mostly a failure, because it was too big of a leap forward for a legacy OS having to cope with legacy applications.

Instead, I believe a proper OS-integrated file versioning system can only find its way into a radically new OS design from which users have limited UI expectations. A way it could work would be, for example, to hijack the way users have been conditioned for ages to continuously hit the save button and its Ctrl + S keyboard shortcut. Such commands are unnecessary in a modern OS that has software commit in-RAM data to disk transparently, but they could be repurposed into “version creation” commands, where the software politely asks users if they want to create a file version, and a short description of that version. This would provide a nice way for users to discover the existence of the feature and get into the habit of using it, although some usability testing would be needed to ensure that the unexpected popup is not too annoying.

File versioning software usually has more feature than a mere timeline of named file versions. For example, it also usually integrates features for multiple users collaborating on a single file, or for “branching” a file across multiple versions existing in parallel. For now, I am not convinced that these features are useful as a core OS file management metaphor, and I think domain-specific software like git would do a better job at introducing such features in a way that’s appropriate for the application at hand, since the proper way to merge files from different sources is a bit of a domain-specific problem. But I would gladly accept any counter-argument.

Securely loading data

At this stage, saving data is sorted out. We can be fairly confident that data can be stored on mass storage media, modified in a transparent way by software, and that the most typical forms of software and hardware failure, and file mishandling by users, are taken care of. That was, quite frankly, the hardest part.

The next file management security question, then, is that of how software should access files when they are loading data from it. Now, don’t get me wrong, I’m pretty confident that applications can manage their file formats on their own, but the job of an OS there is to ensure that no program gets access to a file that it is not supposed to access, nor accesses it in the wrong way. This is the purpose of file access controls, of which the Read/Write/eXecute bits of UNIX filesystems is perhaps the most well-known example.

Encryption

The first security concern that arises when data is stored on a mass storage device, especially when that device is small and external to the computer, is handling the scenario of someone stealing the device. In this scenario, one cannot guarantee that the files will only be accessed using a well-behaved operating system, so any form of OS-enforced file access control cannot be trusted. The only form of data protection that works in this scenario is encryption.

File encryption comes in many shapes, because designing an encryption system comes with a number of engineering tradeoffs. The first one of these is granularity: should encryption be applied per file, or across an entire drive?

Operating on a file granularity means that the encryption system only protects the content of files. Attackers will still have access to a number of metadata such as file names, size, locations, or access timestamps, which may or may not be important. On the plus side, per-file encryption means that separate encryption keys can be used for every file, which means that a breach of security in one file does not effect the other files, and that one can implement secure file deletion as a simple matter of deleting an encryption key. That specific perk of file-level encryption will be discussed in more details later.

Whole-drive encryption offers a more complete form of secrecy. When it is used, an attacker can only tell that there is an encrypted region of a certain size on a storage drive. And an area of research called deniable encryption aims at hiding even that. However, since a single encryption key protects the entire drive, if that key is compromized, the entire drive is compromized. And whenever encryption keys need to be replaced, as happens for example when a cipher becomes obsolete like DES did, the entire drive needs to be rewritten in a single pass, which is a slow, annoying (because the drive is inaccessible) and very dangerous process in which a power failure at the wrong time can easily lead to massive data loss.

In theory, one could combine the benefits of per-file and whole-drive encryption by encrypting each file separately, and then encrypting the filesystem’s data structures using a different encryption key. In practice, I am not aware of any real-world file encryption software that does it this way. This is most likely because it is more difficult to get right than the other two options, since there is no nicely layered abstraction where encryption is present at one specific level of the data storage infrastructure and absent from the others.

Another concern regarding encryption is that it should sometimes be authenticated. Authenticated encryption allows one to detect outsider tampering to encrypted files, and thus to ensure their integrity. It is usually done by adding a digital signature to the encrypted file. Such signing offers detection of spontaneous data corruption as a free bonus, but the performance cost of computing file checksums can get prohibitively expensive for large files, which is why the integration of this feature should be carefully considered. Then again, large file encryption is, in general, always a bit problematic.

Finally, encryption brings with it the issue of key storage. Sometimes, the user can trust certain storage drives to be safe from outsider access, for example if these drives are stored inside a computer that never leaves their home, and the user considers his home safe from burglary. In this case, these drives can be used to store encryption keys. For more sensitive data, however, or when data needs to be accessed on the go, such a scheme is insufficient. In this case, the encryption keys must be themselves encrypted, with a key that is ultimately protected by secret information that the user trusts to be safe enough, such as a sufficiently long password, a hardware authentication token that never leaves the user’s pockets for more than a couple of seconds, or a combination of both.

A short history of file access control

When the file abstraction was originally introduced, in the day of CP/M and the Apple II, it was considered appropriate to let every program under the control of every user write everywhere on every storage media, and the only precaution that was taken was to make entire floppy drives read-only using a small and unweildly mechanical switch when one did not need to write on them anymore.

Then multi-user machines got files too, students started to write malware as pranks, and people in the military and system administration departments argued that not every user should be able to do everything to every file. Thus, the then-growing market of multi-user operating systems started to integrate the concept of an administrator-defined policy on what every user should and shouldn’t be able to do, and file access control appeared in the form of file ownership for a user and a group, and read, write and execute permission for the file’s author, group, and the rest of the world.

This model, and its more flexible ACL variant, remained at the core of every modern operating system as new operating systems were built on top of old ones and application compatibility had to be maintained. But it makes little sense anymore. There is no central administrator tyran on personal computers, and most of these computers only have a single user, or a couple ones. File ownership has largely lost its meaning, much like its “drive ownership” cousin that is used on legacy hard drive filesystems where file ownership information cannot be written. In short, the basic metaphor that we use today for file access control has essentially grown inadequate on personal computers.

To the contrary, the threat posed by malware has grown. With the advent of the World Wide Web, it has become extremely common to install plenty of software on a machine, from untrusted sources, for purposes ranging from work to entertainment. No validation process, from full source code audit to automated execution of binaries in a VM sandbox checking for “suspicious behaviour” prior to release, can deal with the sheer amount of software that is released daily. This means that anyone who claims to know with absolute certainty if software is safe or malware is basically lying in an attempt to make money off a false sense of security. The truth is that every software that a user runs potentially contains malware, and the operating system has to be designed to deal with this unfortunate fact as well as possible.

File access control for the 21st century

Since user files are a common target for malware, CryptoLocker being a recent example, it is clear that the idea that software should have permission to read, modify, or delete every user file without explicit user approval has to go. Software file access should rather operate on a white-listing basis, where software only has intrinsic access to only a couple of files, such as its own configuration files, and needs to explicitly request access to every other file using trusted operating system file picker primitives.

But processes are delegating work, and so, once a process has gotten the right to access a file in a certain way, it should be able to temporarily transmit this permission to other software that do work for them. This can be done by expressing the file access permission that has been granted to software in the form of an unforgeable and revokable token of authority, called a capability, that may be passed around across software for the purpose of performing some task. Of course, once the work has been carried out, the software which started it should discard the capability and have the operating system invalidate it, so that no unnecessary privilege is kept by anyone around.

I will discuss capability-based security in more details in a later article, but one nice thing with capabilities is that they can be saved and reloaded when processes are restarted. Another is that as a whitelist, they are best stored alongside programs, and not the files that they have access to. This should drastically ease the process of sharing storage drives between multiple operating systems with differing file access control policies, a process which is much of a headache today, even though hardware innovations like USB pen drives have made it a common usage scenario.

Securely copying data

Once files can be saved and loaded by software, computer users will often want to move them around, for purposes such as organizing them, sharing a copy with a coworker/friend/lover, or making backups.

Moving a file within the boundary of a single storage drive is actually a purely abstract operation. Even though some mobile application designers like to think of today’s hierarchical directory structures as some deep guru insight of where data is actually stored on a disk drive, and thus an implementation detail that should be hidden, the file organization that is presented by the operating system’s file explorer is actually just a bunch of metadata that is stored alongside files for the convenience of users. “Moving” a file inside a storage drive is just modifying this metadata that the OS keeps about it, which is why on most operating systems it is a very fast operation. And from a security point of view, it also is pretty much nothing new.

On the other hand, copying data from one drive to another is a different story.

When a file is copied to a new storage drive, what should be copied? Contents? Metadata? Some metadata, but not some others? Whenever the purpose of making a file copy is sharing data with someone else, anything that is copied to the external drive should be considered shared with that person, and that means that some information security reflexion is in order. Unfortunately, this is a form of caution which most of us are not well-trained for.

Sometimes, this lack of caution is beneficial, as when crackers get caught because they left personally identifying information in their malware executable. Most of the times, the recipient of data simply does not notice the metadata that he has received alongside it. But at other times, the divulgation of metadata to someone else can get somewhat problematic, more so as the amount of metadata that is stored on the filesystem grows.

Did it ever happen to you that a boss made a joke about your file access timestamps as you handed a file to them? That is a simple case of metadata-driven information disclosure. Others include, for example, knowing with whom you were talking on the phone, when, and for how long, without knowing the contents of the discussion. Geolocation metadata, that has grown pretty popular in today’s mobile operating systems and digital cameras, has also proven to be very prone to abuse. In short, there is a growing need for software that helps user manage the metadata that is stored about them and avoid transmitting it to the wrong person, and operating system software is no exception.

A military UI metaphor that could be used here is that of security domains. External storage drives could be considered trusted or untrusted by the operating system, following a user prompt on the first time that they are plugged in, and if they are untrusted, the operating system should transparently discard some sensitive metadata when files are copied to them. The problem is that this assumes that the operating system is able to uniquely identify external storage drives, which is nontrivial and requires some facts-checking. Otherwise, an attacker could just copy the authorization information from one drive to another, and lead the OS to mistakenly believe that a drive is authorized for the storage of sensitive information when it isn’t. But if this precondition is verified, I think that’s a track worth exploring.

Securely deleting data

Finally, after data has been saved, loaded, edited, saved again, passed around to supervisors and friends, and has overall lived a long and healthy life, there always comes a point where a given file is not needed anymore, and the user wants to part with it by deleting it.

Handling accidental deletion

In a general sense, deletion means data loss, so operating systems which are serious about information security always have provisions against accidental file deletion. The two main UI metaphors for this today are deletion confirmations and trashing, none of which are really quite optimal :

Modern trash implementations are permanent and largely hidden from users who do not suffer from emptying OCD. This means that data will gradually pile up there until the storage drive is full, and the user suddenly has to deal with massive amounts of trashed data at once. His reaction will usually be to delete everything, no matter how recent, defeating the point of trashing as a way to give oneself some time to change one’s mind.
Confirmation dialogs are an older metaphor that made a comeback recently as mobile operating systems tried to start a war on user-visible filesystems. Like any dialog that appears very often, they simply don’t work: the human brain is designed to handle such repetitive action by hard-wiring it to muscle memory and not consciously thinking about it after a while.

In the real world, trashes are kept manageable by having a limited size, which forces users to empty them frequently. This is not optimal, due to the issue mentioned above of treating freshly trashed items in the same way as items that were trashed a long time ago. And another problem with that design is that if a large item is thrown in the trash, it has to be emptied almost immediately no matter what was inside before. But I think there’s some inspiration to be taken there.

As I mentioned a long time ago, the optimal design from my point of view would be a trash that automatically deletes item based on a time-based criterion. After some time has elapsed, like say, one month or one year, the user can be considered to have forgotten about a file in the trash, and that file should be purged permanently in order to keep the trash to a finite, somewhat reasonable size. I actually have since successfully tested this metaphor on a data acquisition system that I built at work : in the graphs showing the live data stream for monitoring purposes, data point are kept up to about one hour worh of data, and then the oldest ones are discarded. In this context, at least, I’m happy to say that the design has worked like a charm.

Deleting versus discarding

Another security problem that arises when it comes to deleting data is that when users delete a file from a drive instead of merely trashing it, they expect that file to be actually gone, and the data it contains to be unrecoverable by simple means. However, destroying data by overwriting it with zeroes or random numbers is a lengthy process, and for this reason, many operating systems will instead take the performance shortcut of merely destroying all metadata that points to the file in the filesystem’s data structures, and leave the file data as is on the disk until another program needs to use the space it occupies.

From a security point of view, this is obviously the wrong way to go. Basic data recovery software available to everyone and his dog can recover data from such “deletion”. A more sensible approach, as for large disk writes, would be to use asynchronous operations for file deletion: schedule the deletion now, and finish it later. But this approach has the same problems as disk caching: one has to ensure that the actual deletion is performed at some point, and for real storage drives which can be unplugged by the user or brutally unmounted by a power loss, there is no way to guarantee that this will happen.

If this was not bad enough, another issue arises when one uses SSDs based on flash memory. Due to the way these work on the inside, overwriting a block of data is a very expensive operation for them, and they tend to work around this by garbage-collecting the overwritten block and starting to write to a new one instead. Obviously, this is again a problem from a security point of view: a garbage-collected memory block is not truly gone, it is just kept hidden somewhere until someone uses electronics testing or drive firmware bugs to read it anyway.

A workaround for both of these problems exists when

The data that is saved on the drive is encrypted
Encryption is carried out on a per-file basis
The per-file encryption keys can be stored on a separate, trusted storage drive

If all of these conditions are met, deleting a file, no matter how big, is as easy as erasing the file encryption key on the trusted storage drive. In that case, an attacker trying to recover the file will merely end up with a bunch of encrypted junk, at little to no deletion cost. However, criterion 3, which is critical to the security of this approach, is pretty hard to meet in practice. If it cannot be met, the next best approach is to use per-file encryption with keys stored on the drive, delete files by overwriting the block containing the encryption key, and fall back to full drive erasure or destruction where security against skilled data recovery professionals is desired.

On this front, it’s worth pointing out that wiping entire drives can be done a lot more securely than wiping files, by individually erasing all of the drive’s sectors, and then some to account for storage space over-provisioning. But this approach is extremely slow in practice, and thus cannot be recommended except for small drives or extremely sensitive information.

Finally, one should note that since deletion is essentially voluntary data loss, there will usually be a compromise between a system’s resiliency to data loss and the security of its deletion procedure. Most software procedures designed to offer security against data loss, such as versioned backups, will end up keeping around copies of data that should have been gone, and most software procedures that try to destroy data permanently will end up also destroying copies of said data that are kept around for reliability reasons, such as named file versions. There is an unavoidable tradeoff there, which is not a problem in and of itself, but needs to be understood by both software developers and users before a data handling policy that is adequate for one user’s specific needs can emerge.

Conclusion

With this article, I hope I will have convinced you that although files are truly a beautiful abstraction from a software engineering point of view, they are anything but simple to handle from a security point of view. Avoiding file management-related information security incidents requires paying attention to every part of the file lifecycle, from its creation to its eventual deletion, and there are many things that can go wrong in the most seemingly simple file operations that are saving data to them, loading data from them (or more generally accessing them), copying them across drives, and eventually deleting them.

As a summary, since this was a bit of a long post, the specific information security issues that I have discussed are

The reliability issues of disk caching
The need to be resilient to both software, hardware and user failure, and various strategies that can be used to this end
Physical storage drive security through encryption (and its variants and tradeoffs)
The need for new file access control policies on personal computers in order to meet 21st century security challenges
The problem of metadata disclosure when files are copied around
The conflicting needs for protection against accidental file deletion and actual file removal upon voluntary deletion

The OS|periment

Musings on personal computer operating systems