No human being could understand the source code of an even mildly complex computer program if it were displayed before his eyes in a fully inlined fashion, used unstructured data, and was stripped of any human-readable comment. For this reason, abstraction has always be a central concept in programming languages and practices.
Among popular programming abstrations, data encapsulation in particular has been used a lot in recent decades, in the wake of a persistent craze for the loosely defined concept of “object-oriented programming”. But is it such a good idea to decouple programs from the data they manipulate like that?
This article will discuss some issues of data encapsulation in the context of long-lived software like operating systems, and possible strategies to mitigate them.
From abstraction to data encapsulation
Abstraction is the art of mapping complex computer problems onto simple programming language constructs, so that one may use these constructs as a black box without caring about what happens inside. And when it comes to the procedural model of packing complex code inside of simpler functions and procedures, I think everyone will agree that it is a good thing, as compared to flat spaghetti code filled up with GOTOs or JMPs.
Many object-oriented programming languages, however, go one step further, and also attempt to hide program data inside of complex black boxes, called objects, that can only be accessed through a procedural interface. Through language mechanisms, programmers may ensure that the most straightforward way to directly accessing the data of an object is an illegal operation, and thus that only hackers will attempt to access the inner structure of their objects, while the rest of the world will behave nicely and solely access objects through the code they designed for it.
This mechanism is called data encapsulation (or information hiding, depending on who you ask). It was designed to serve a number of purposes, including trying to control the ways in which a piece of data may be accessed by software, and ensuring that modular code developers may change the inner data structures behind their objects without accidentally breaking code that uses them.
But in practice, whether encapsulation is used or not, altering a data structure that is exchanged, in any way, with the outside world, remains a problematic programming practice. I will now demonstrate this.
Data as part of an application’s interface
If one attempts to take data out of an application’s interface, it unfortunately raises a number of questions:
- What happens when data is exchanged at an application-library interface?
- What happens when data is exchanged between two processes communicating over a network, one of which may not be easily modifiable (third-party developers, idiosyncratic update policies…)?
- What happens when a program serializes data to a storage support for use by a later version of itself?
As it turns out, the only data structure that can be safely changed in a computer program is that which is never exchanged with the outside world.
In the dark ages of computing, where programmers devised complex file formats that were essentially a serialized version of their inner data structures, this was not a problem. The problem, rather, was to find a way to keep file formats in sync with the inner data structures of the program whenever said program changed. And to remain compatible with old files in new versions of the software.
However, we now have moved beyond such effort duplication. Modern programming languages usually sport a generic way to serialize near-arbitrary data structures to disk, without any custom wrapper. Custom file formats are thus now saved for edge cases where, for some reason, this default serialization scheme is not powerful enough. Programmers will readily use such primitives whenever they need to save data to disk, instead of designing custom on-disk formats and wrappers for this common purpose.
The flip side of how simple data serialization has become, however, is that it is increasingly hard to discriminate which data is going to be used solely inside of an application, and which is going to be exchanged with the outside world. And, what is of interest to us here, to keep track of changes to an application’s data interface, and the compatibility breakages that they cause.
Tracking data interface changes
The only way data structures that are part of an application’s interface can change safely, is through the addition of additional optional members. Anything else is likely to break a form of application compatibility, either with third-party code or with legacy data.
The safest and most general way to test this property, once the application’s data interface is finalized, is to build a set of automated tests that are kept separate from the core application codebase, with their own definition of the application’s data structure, and then try to exchange data with the application by whatever means are available, so as to check if data input and output is handled properly.
Unfortunately, data encapsulation makes this task particularly difficult. By making it extremely difficult to construct arbitrary application input and output, and by forcing the use of objects’ procedural interfaces, it requires a tester who would like to assert data compatibility to use a full copy of the object’s definition, including its code, and to access its data solely in the ways that the object’s designers have designed, which makes it harder to expose data compatibility bugs.
Using objects in the way they have been designed for instead, by accessing them solely through the procedural interface of the program being tested and not caring about their true inner structure, means that it is possible to introduce a software change that breaks the software’s data compatibility, but does not break its automated tests, since the semantics of the tested program remain identical.
This flaw is, as far as I can tell, intrinsic to the very concept of data encapsulation. It’s the way it’s been designed to work. And it makes testing encapsulated object-oriented code for backwards compatibility, which is just a subset of software compatibility considerations, a royal pain already.
Headers as a basic compatibility test
The aforementioned flaw may be mitigated when programming languages allow code to have a programmatically defined public interface that includes data structure and function definitions. Especially if this interface may checked for consistency whenever two piece of code are dynamically linked with one another.
This public interface file should be easily generated from the original source code, copy-pasted if possible. In languages which have source file headers/specifications, and implement link-time consistency check at link time like Ada, that functionality may be used to this end.
Unfortunately, many popular object-oriented programming languages, such as Java, C# and Python, have all but dropped support for source file headers, arguing that they impose unnecessary duplication between code and its header file. These language’s designers failed to realize the compatibility-preserving power of source-header consistency checks, most likely because they never tried a programming language which had them.
Linker consistency check can, without requiring much programmer effort, notice a whole lot of simple software compatibility breakages, such as the addition or suppression of structure fields and function parameters. Though obviously, more subtle breakages related to the exact semantics of functions and data will still require dedicated custom testing protocols to be noticed.
Dynamic languages: the ultimate compatibility nightmare ?
So far, I have assumed that program data is well specified in a program’s source code. That the structure of data objects is well-known, and that the type of every variable that is being manipulated is well-known too, or can at least be traced back to a well-known data type through inheritance and inclusion hierarchies. These properties are well verified by statically typed languages such as C++, Pascal or Ada.
In these languages, one may easily modify data structures at run time, add or remove members to them, change the type of variables, and even call member functions of encapsulated objects without a clear knowledge of what kind of object is being manipulated, or whether it even has this member function.
This feature basically completes the divorce between program code and data that was initiated by the object-oriented world’s push for encapsulation. It does so on the premise that in many situations, having precise knowledge of the kind of data that a program manipulates is unnecessary, and may even be harmful to code reusability.
However, this reasoning is, as discussed before, only true for self-contained and stateless programs that do not exchange structured data with the outside world.
Otherwise, such a mentality ends up being quite harmful to the preservation of software compatibility, even more so than the careless use of encapsulated objects. And all dynamic languages have to offer to address this problem is the widespread ability to store (byte)code alongside data. This is a practice whose security implications I probably need not to expand upon, in order to conclude that it should be reserved to a handful of specific, well-controlled scenarios.
In their perpetual attempt to make programmer’s life easier, programming languages have increasingly been moving in a direction where the data that programs manipulate is hidden, unspecified, and may change at any moment.
It seems to me that the moves that they make to this end, such as encouraging the encapsulation of data or the universal use of dynamic data types, makes it very difficult to preserve software compatibility across evolutions. That is because whether software authors want it or not, structured data is nearly always part of a program’s interface.
By encouraging the emergence of a world where programmers know nothing about the data they manipulate and do not want to care about it, data encapsulation and dynamic typing are, in my opinion, harmful to compatibility. They make it harder for programmers to notice when they are breaking their software’s interface, and to build automated test to ensure that this only happens in a controlled fashion.
Do you think that I am overreacting to this? Exagerating the risks? Or do you agree that excessive data encapsulation may lead to the emergence of a world where software data will ultimately become as transient and perishable as the supports it is transiently stored upon. Where software data will be like the VHS, audio tapes, and CompactFlash cards of my youth, only with a much shorter lifetime and with additional dynamic linking and network communication issues to make it worse…
If you agree, how do you think one could devise programming languages, practices, and cultures that enforce a more rigorous preservation of software compatibility, without at the same time neglecting developer comfort?