The importance of error management

After spending some time considering good practices for the handling and organization of data, let’s move a bit to the related subject of data loss and software reliability.

Most of the time, data loss occurs due to errors, a broad term which in software engineering jargon can designate two very different things:

  1. A hardware, software or human issue requiring attention, which breaks the normal flow of computer programs.
  2. A kind of user mistake where software does what a user ask for, but not what the user meant.

In this post, we will discuss what happens when errors are not managed properly, and software strategies for managing them by identifying, preventing, detecting, handling, and possibly reporting them.

Why do we need error management

Unfortunately, error management is one of these realms of engineering where it is much easier to have a laugh at people doing the wrong thing, than it is to explain how to do the right thing. So we will begin by explaining how NOT to do error management, and to do this, we’ll first have a look at hardware errors.

Contrary to widespread programmer belief, although computer hardware is not without its quirks, it is overall pretty reliable. After all, trillions of times a day, typical computer hardware correctly and quickly performs its designated tasks over and over again. And, along the way, silently handles a number of basic issues that software should not have to care about, such as cache misses and minor forms of memory corruption.

Yet at times, hardware encounters problems which it cannot solve on its own. Examples include overheating, communication errors caused by aging wires, or the amount of bad sectors on a storage device growing beyond the limits of what is fixable by relocation. These situations require at least OS intervention, and possibly user attention too.

In that case, hardware notifies OS software of the problem through a standard notification channel, such as an invalid function return code, a status flag, or an interrupt. Typically a combination of these. If OS software does not correctly catches this error signal and ensure it is acted upon, a number of Very Bad Things may occur, including:

  • Massive data loss
  • Hardware damage (or, in extreme cases, user damage)
  • Problems randomly crawling up the software stack and causing a variety of unpredictable, seemingly unrelated, glitches and crashes

Let’s see a look at this in practice, using a couple of unfortunate hardware-related events that happened to me recently, which were in all cases worsened by improper error management in some of the software I use.

How not to do error management

The silent SSD death

Recently, I’ve had an SSD die on me after only six months of useful life. That sucks. But what sucks most, in my view, is the way I’ve had to endure weeks of random computer freezes and random data losses before figuring out what was going on.

It all started with ATA errors. Basically, the drive’s ATA controller, which is responsible for communication with the rest of the computer, started to randomly reject OS commands like read and write requests for no clear reason. From time to time, it just stopped listening to them for a while.

The Linux kernel, running on my computer, noticed what was going on, because its ATA requests got no reply. Unfortunately, that was not a mistake which it could fully handle on its own, because it might be a physical problem like a SATA wire going loose, or a dying drive. But like most OS kernels, its mission is to keep the computer running at all cost, and it had to try its best at it. So its reaction was, overall, very sensible: attempt to reset the ATA link, and reconnect to the drive. Then log warning messages in dmesg so that I, as a user, can become aware of what’s going on and act on it.

The problem is that, like most Linux users these days, I use the OS through a GUI interface. And Linux GUIs do not forward dmesg kernel messages to a visible location. So from my point of view, all that was going on was the computer randomly freezing upon mass storage I/O, in a non-reproducible way. Then, after a while, growing symptoms of data loss, like warnings about unreadable files, or desktop configuration resetting itself.

That was worrying, and it puzzled me for a long while as I checked for suspicious process activity and found nothing, until I finally thought about checking dmesg output, and then promptly recovered by identifying the culprit, proving that the drive was faulty (rather than the connections from it to the motherboard), and removing it.

Lesson learned : Error management transcends abstraction boundaries. If something happens that requires user attention, then there needs to be a way to notify the user, no matter how far down the software stack the problem happened. Low-level problems may require high-level attention.

The full partition

Do you know what happens when one of your disk partitions becomes full, and you attempt to write more to it? The write fails obviously, and then the error bubbles up the software stack, from one library to another, until it reaches the program which actually requested the disk write.

In the case of C programs based on the standard C library, which is all but the norm on UNIX-based operating systems, the error is notified to the program through an invalid return code from the function used to perform the write. fprintf(), for example, which normally returns the amount of characters written to the file, returns a negative number instead. Programs may then optionally check the errno global variable in order to know more about the error that occured.

However, a problem with special function return codes is that software has to actively check for them. Novice, distracted and careless programmers will typically forget to do so, which result in the program erronerously continuing its operation as if the data was properly written to disk, lost in an inconsistent state without knowing it.

As I have experienced, practical symptoms of this situation include data loss, random program crashes with very cryptic (seemingly unrelated) error messages, and strange system slowdowns. This time, dmesg was useless as a diagnosis tool, and I was once again at a loss as for what was going on. The clue only came when I used a program which was better written than the others, GIMP, which finally displayed something which looked like a “no disk space left” error. Then I could notice what was going on, and recover by freeing up some disk space.

If you have ever run out of RAM and swap on Linux, you may have noticed that the symptoms are very similar. Things crash randomly, with very unclear error messages. That is because the root cause of the problem is the same : errors are correctly reported up to the libc level, then silently discarded by poorly written application programs or the third party libraries which they rely upon.

Here, the lesson learned is that error management is mandatory. Failure from program to properly handle errors leads to incomprehensible behavior and disaster. And this is one reason why exceptions have replaced invalid return values in all modern programming languages: when errors aren’t handled in an exception-based programming language, it results in the faulty program crashing instantly with a reasonably clear error message, instead of acting drunk. This crash prevents the poorly written program from causing further harm to the system, though it can also have very undesirable consequences in itself as exception detractors like to point out.

Again, exceptions or not, error management is not optional. And now that we have seen why it is necessary, let me try to propose a way to do it properly.

Error management guidelines

Identification

The first step towards error management is to identify the points in a program where errors may occur, and the kinds of errors that may occur, as accurately as possible.

This step must occur as early as possible during program design, because the response of a program to an error is an integral part of its design. Libraries must document which exceptions they are going to throw, end-user interfaces must feature interaction primitives for error reporting, and error recovery strategies need to be included in the feature plan of a software product.

Unfortunately, this is also probably the most difficult step, as it requires software designers to think outside of the scope of their specific area of application, and in a global computing system mindset. Again, errors do not stop at OS component companies, and may sometimes need to bubble up very high in the software stack before they can be handled appropriately.

Another error identification gotcha is that sometimes, when lower-level software components are updated, they may be upgraded to throw a wider variety of errors. Existing client software must, without being updated, correctly address these new errors that it knows nothing about by providing a last-chance error handler that does the right thing, including not gobbling up errors which are destined to an OS component higher in the software stack. This part may prove extremely challenging.

Prevention

Once ways error can occur have been identified, and isolated in scenarii reminiscent of the threat models used in information security, the next step in error management is to try to prevent as many of them from occuring as possible. And, when they do occur, to reduce their impact on software operation to the minimum extent necessary.

The most avoidable kind of error is user mistakes, because these can generally be adressed through better user interface design. Good software UIs go through great length to prevent them users from shooting oneself in the foot, and interfaces that are failure-proof without being annoying is one of the landmarks of great software UX.

At the other side of the spectrum, hardware errors are often not avoidable. For example, it is very hard to prevent a CPU from overheating in software, if the computer’s thermal design is bad to begin with, and it’s impossible to prevent power failures from affecting software operations on computers that do not have batteries.

Reducing error impact, on its side, is usually achieved by reducing the granularity of software processes. Take, for example, a data acquisition system that continuously measures some physical property over a period of one year. If acquired data is commited to disk once an hour, the impact of a power failure after 364 days will obviously be much weaker than if the system waits for the entire year to have gone by before writing all acquired data to nonvolatile media at once.

Detection

Detecting errors when they occur is critical to handling them in software. For the most part, the only cases where errors cannot be detected automatically, is if hardware is not equipped with the required sensors, or if they originate from user mistakes.

If errors have been properly identified, detecting them is usually just a matter of putting exception handlers and return value checks at the right place in software code, and supplementing these with a proper program-wide error handling infrastructure. Note, again, that exception-based error reporting is safer there, in sense that undetected errors cause program termination rather than improper program execution.

Errors which cannot be detected or recovered from in software, on their side, are usually best handled by providing users with a way to go back in time upon noticing that an error occured. For example, silent disk failure is best handled through backups, and user mistakes are best handled through robust multiple undo mechanisms or file versioning, as opposed to incessant confirmation dialogs that become skimmed through or overriden through –force CLI switches.

To further add to my point about confirmation dialogs, note that in some cases such as brutal USB pen drive removal, one cannot ask a user for confirmation before he makes a mistake. Whereas one can almost always design software systems in an undo-friendly fashion.

Handling

It is very hard to provide general guidelines for error handling, because the appropriate way to handle an error varies enormously depending on the kind of error that occured, and the means available to address it in software.

Still, error handling in reliable software usually revolves around returning the software to a consistent state in a fashion that loses as little state information as possible. Much like journaled filesystems are able to return to their latest consistent state after a power failure, minimizing data loss.

Sometimes, though, software cannot handle errors on its own, and must call other software up and down the OS stack for help. Other times, a software operation may return a valid return code without having successfully completed, depending on its design. Sometimes a part of the software system must restart itself, other times everything can keep running just fine in a degraded functionality mode, as when CPUs are throttled as a fix to overheating. So again, the proper way to error handling really depends on the kind of error that occured.

Reporting

Reporting errors to machine users may not be necessary. For instance, any software project which directly interacts with computer firmware, especially if it resides on a GPU, was written by Apple engineers or is in any way related to ACPI, has learned to silently work around firmware flaws for the sake of user sanity. But sometimes, problems must be brought to the attention of end users, administrators, or engineers.

When error reporting is necessary, it must be designed in a very thoughtful manner, for a number of reasons:

  • Not all errors are born equal. Some simply require users to take notice (e.g. “Wireless connection lost”), whereas other require user acknowledgement and action (e.g. “Battery level critical”). Putting everything on the same level will just lead users to get annoyed and dismiss important error messages.
  • Error messages must be tailored to the kind of user they target. They must be both understandable to the user receiving them, and harmless to every user involved.
  • In particular, error messages should not leak information suitable for exploiting software security flaws to end users.
  • Nor should they leak privacy-sensitive information to developers or sysadmins.
  • Users may not be there to process an error message immediately, and error reporting mechanisms must be designed to account for that.
  • Logging is a critical system diagnosis tool, and must not be left as an afterthought.

Conclusions

Even the most reliable computer system cannot always operate as expected. Because of hardware problems, invalid requests, or user mistakes, software often needs to stray from its regular path of operation. This is called an error.

As was discussed in this article, failure to manage errors appropriately leads to unfortunate consequences, which can range from minor visual glitches all the way to hardware getting damaged and people getting injured or killed. For this reason, error management is a critical and mandatory part of good software design.

Here, I propose an approach to error management based on five core pillars: identification, prevention, detection, handling, and reporting. I believe that such an approach is suited to the error management needs of all practical software, without forgetting any important part. I also provided guidelines for implementing all of these parts of error management.

And that will be all for today, so thank you for reading!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s