Input sanitization is a well-accepted software design principle these days. Pretty much every developer worth his salt will agree that production-grade programs should not hang, crash, or do some very funky and undesirable things in response to user input, no matter how malicious the user is and how malformed the input is. But in low-level software development, when building programs that have a lot of power, I would argue that software should also be very mindful of its output, and the harm it may cause to the system they run on. This post will be a rough discussion of that.
Hardware is not idiot-proof
“But”, some will argue, “if hardware firmware itself was designed accord to reasonable input sanitization principles, there’s no way the OS could cause permanent damage by sending valid commands to it. There’s no way hardware could still exhibit the kind of crippling vulnerability that allowed one to fry input peripherals by treating them as output back in the 80s!”.
And I wish they were true. I wish the drastic increase in chipset processing power and firmware complexity that occured during the last few decades came with equally drastic improvements to the peace of mind of low-level developers, letting them make mistakes without it being too much of a big deal. However, that is just an assumption, and as a matter of fact, rarely in software history has an increase of complexity ever resulted into a corresponding increase in reliability. One could rather make the case that due to ever-tighter development schedules, developer laziness, and codebase bitrot, the reverse tends to happen.
Moreover, hardware cannot, fundamentally, be made more mistake-tolerant like software did, because by its nature, low-level software has to be able to do very dangerous things. Filling a disk drive with zeroes or random garbage? That’s just an alternate description of secure erasing. Disabling interrupts, allowing for full CPU lock-up on an infinite loop? That is necessary in order to avoid race conditions when processing one of said interrupts. Erasing the firmware of a device, turning it into a useless brick? If said firmware is upgradeable, then erasing the old firmware may be a perfectly valid software request in preparation for the writing of a new one.
And that’s just for ideal-world, pie-in-the-sky hardware, which is designed by perfectly smart people and does no stupid things. In reality, though, hardware people tend to be pretty bad at software to begin with, and then write their firmwares using pretty bad development environments to make it worse. So what happens in practice? Firmware randomly writes to RAM that is supposed to be free (hello Apple EFI implementations). Devices appear or disappear depending on who the OS claims to be (when the computer doesn’t just commit suicide on the face of something that is not Windows, hello Samsung UEFI implementation). Peripherals on buses with external sockets like Firewire and Thunderbolt get full access to the machine’s RAM.
Would you trust firmware that does these kind of things to sanitize its input properly? I wouldn’t. It’s not even working in response to valid input to begin with. Yet we low-level software developers have to support it anyway. And that means, as you can guess, that lots of caution should be exercised.
In practice: Mandrake 9.2 vs LG CD-ROM drives
Now, the detailed example I’ve chosen in order to illustrate this may seem a bit old. After all, none uses either Linux Mandrake nor CD-ROM drives these days. But I’ve picked this one because I find both the nature of the bug and the way the Linux community at large reacted to this very interesting.
Let’s start with the symptoms though. We are in the early 2000s, and Mandrake Linux, at the time the most popular Linux distribution for beginners, release their latest and greatest 9.2 release. People try to install it. Their CD-ROM drives become totally unusable as a result. Closer examination reveals that their firmware has been entirely wiped. Quite the newbie-friendly experience, indeed.
Now, further investigation goes on, and finally uncovers the code responsible, in a Linux kernel patch that was merged by Mandriva in that release. The goal of the patch was to offer a way for the kernel to discriminate between CD readers and burners, so as to avoid problems later on. Its way of operating was to send an ATAPI command that should only be implemented for CD burners, FLUSH_CACHE, the rationale being that this specific command should do nothing of consequence on a CD burner, and trigger a detectable error on CD readers.
So, essentially, the patch was relying on the error handling behavior of a device in response to a command that was not mandated for implementation.
Meanwhile, on its side, the LG firmware on the CD-ROM drive did something equally clever. Since it didn’t need to implement this specific ATAPI command, it had re-purposed it into a “firmware update” command, which would wipe the device firmware and wait for a new firmware image to be flashed on the drive.
Analysis of this incident usually focused the blame on LG, who violated the ATAPI standard by using a valid opcode for something totally different than its intended purpose, or on Mandrake for not testing the patch on every possible CD drive on Earth (“Screw practicality ! You have tests !”). But although I wouldn’t dispute that either would have avoided this specific incident, I think that the way the kernel hacker who wrote this patch tried to have his way is also pretty jarring.
You are writing software that faces finicky hardware firmware. This stuff is highly unstable. It can explode just by you looking at it. And what does your software do? It pokes into it randomly. It relies on hardware reacting correctly to a command that’s not even supposed to be there. That’s basically no different than having a program execute data at an arbitrary RAM address as code, under the rationale that if there is no real code in there, it will eventually crash. Sure, but you don’t know how much harm it will do beforehand. And in my view, it’s not okay to try just because it happened to work on the specific hardware configuration that you tested it on.
Proposals for better output sanitization
So how can we ensure that software does a better job at not harming the hardware underneath? I have a number of proposals to this end, taking inspiration from some programming practices of the mission-critical embedded world. They will not probably not solve the full output sanitization problem on their own, but I think they are a step in the right direction, and would encourage other people smarter than me to pursue a reflexion in this direction:
- Start by respecting the specs to the letter. What’s written in them must be assumed to be strictly exact, and what’s not written in them should not even be tested for. Because no matter how much existing hardware it has been tested on, it’s not guaranteed to continue running in the future. When debugging buggy low-level software, make sure that it fully respects the specs before moving to the next step.
- If the specs are provably wrong, don’t just make a workaround and post a rant to your small group of fellow developers. Contact the hardware manufacturer who violated them and tell them about this issue. If the problem lies in the design of the spec itself, contact the group responsible for writing it. Don’t let problems pile up for future generations just because you found your own kludgy way around it.
- Ideally, if the means are available, test your software against the spec before testing it on real hardware. For sufficiently simple specs, the best option may be to write a simple emulator, or reuse an existing one. Make sure that the emulator you use is designed for low-level debugging, not just performance: undefined behavior should be made truly undefined through randomization, and when possible, accessing illegal values should lead to run-time warnings from the emulator.
- Automate the tests. Anything you want to test yourself, try to make an automated build environment check on its own too. Again, well-designed emulators can help greatly with this, since some tests may be too dangerous to be repeatedly carried out on real hardware.
- To reduce the impact of any remaining bugs, use modular programming practices. The more dangerous a software component, the more restricted its purpose and codebase size should be. Policy/mechanism separation can be used to ensure this. An example of how NOT to do it is to ask video drivers to implement the full OpenGL spec, the way things are currently done in the Linux world (though with the Gallium State Tracker Interface, it seems the Linux video folks have realized that something is wrong and set out to fix it).
Any further ideas?