Language-agnostic abstractions considered harmful

To conclude this first series on programming languages for OS development, I’d like to discuss the closely related issue of communication between programs written in different programming languages. This is an important issue for general-purpose operating systems, because these host, and thus communicate with, programs written in a wide range of programming language. Here, I’ll discuss typical approaches to this problem, and how I think TOSP should go about it.

What happens when two programming languages communicate?

A hard abstraction mapping problem is involved

Operating systems which are mainly written in C, or a close derivative such as C++, have a tendency to only communicate with application programs using a C-based ABI, complete with limited abstraction capabilities, null-terminated strings, and weak type safety. When faced with this observation, the first reaction of any self-respecting OS nerd will be to rant about what reeks of sloppy design and coding practices.

However, consider how you’d go, as an example, about communication between programs written in Assembly and C#.

The former is as low-level as programming languages get. It has no portable dialect across multiple hardware architectures, is not even able to handle strings on any modern piece of hardware, and requires careful thought when implementing something as basic as a subroutine. The later is as crowded with high-level abstractions as programming languages have gotten yet.

No matter how you go about it, there is no clear mapping between the abstractions of both languages. You can’t express the “lock cmpxchg8” of x86 Assembly in C# any better than you can handle a C# class with overloaded aray adressing operators cleanly in Assembly code. Even the data types are not comparable. Both languages are miles away in terms of abstraction, because they are built to serve entirely different purposes. Assembly is about precise control on machine operation, whereas C# is about quick and (hopefully) clean coding at the application level.

Though this example was voluntarily picked to be extreme, it does illustrate the point that communication between different programming languages is an intrinsically hard programming problem. Any time programs written in two languages of distant lineage must communicate with one another, someone will have to sit down and think deeply about what kind of common abstraction they might use to this end.

After all, due to the magic of nonstandard ABIs, even two programs written in the same language can have a hard time communicating with one another sometimes.

Performance fundamentally cannot be very good

Programming languages also have many ways to go about handling data. Implementation of boolean data types, as an example, are extremely varied, and for a single programming language can change from one architecture to another, and even from one level of compiler optimization to another. String handling is another example of task for which no right, standard way of doing things has been established. Ada even exhibits three different ways of handling strings in one single programming language !

This means that when programs written in one programming language want to communicate with programs written in another language, they will have to convert data from one representation suitable to the source language, to one representation suitable to the destination language. This operation can be extremely expensive, depending on how different the two internal data representations used by the programming languages (or their implementation) are.

As an example, let’s imagine that a given implementation of a programming language A has support for relative memory pointers. In that implementation, pointers targeting “the memory block +48 bytes away from this pointer” are legal. This means that two programs written using A can pass linked lists to one another without any serialization process, as long as it is ensured that all linked list items are kept in a contiguous block of virtual adress space, using the same ordering.

Now, let’s try to have these programs communicate with another program, this time written in a programming language/implementation combination B which only has support for absolute memory pointers. In this language, pointers can only target a specific address in a running program’s address space. With a little bit of thought, one quickly realizes that transmission of linked lists to this program will have to involve a performance-intensive bit of pointer rewriting, in which every pointer of the linked list will have to be re-computed to target the proper position of the destination program’s adress space.

Another example of expensive data conversions occurs when two programming languages store Unicode strings with different internal normalizations. When large strings are passed around between programs written in these two languages, the communication framework will have to parse them and rewrite them using more or less characters, which will involve lots of data comparison and moving around, and possibly some memory reallocation too.

From these examples, it should be clear that where performance is desired, as is the case for many core operating system facilities, communication between programs written in multiple languages should be avoided. Note that here, I have reached this conclusion irrespective of the exact communication mechanism being used. But let’s now try to answer the question raised in the beginning of this article, of why I would consider typical attempts at easing interaction between programs written in various languages even more harmful.

Trying to standardize cross-language communication

Lowest or largest common denominator ?

As we have discussed before, programming languages have widely varying sets of abstractions. Some are object-oriented, others are designed for procedural programming. Some use null-terminated strings, others prefer bounded arrays. Some have native support for Unicode, others have only integrated it through lots of painful taping and ugly hacking. Some languages have unions, others have closures. Some give you a lot of control on what the machine is doing, others focus on the clean specification of programmer intent.

Making this zoo of abstractions interact with one another in a standard way is intrinsically difficult, and a wide variety of approaches have been tried. But these can mostly be separated in two broad categories. Those that attempt to set a lowest common denominator of programming language features, and those that attempt to set a largest common denominator of these.

The former is arbitrary by nature, because as soon as a sufficiently broad set of programming languages is considered, it becomes impossible to find common features. To get an idea of the difficulty of the task involved, consider, for a second, the fact that not all programming languages have standard support for integer manipulation. That’s how bad it is. So typically, users of this approach decide on a set of language features that every programming language “should” support in order to interact well with the operating system. And since programmers pick abstractions that are familiar to them, that’s how we end up with C APIs as the only standard way to interact with OSs. In effect, this approach ends up being as good as doing nothing to help cross-language communication.

The later approach, of designing an all-encompassing framework for communication between each and every programming language in existence, is less arbitrary in nature. Designers of such frameworks, like SOAP or CORBA, typically start with a feature set that they deem “good enough”, and then prove willing to be adding more expressiveness to their language-agnostic communication framework given a good enough rationale for it. Sometimes, they are very thorough at supporting, from the get go, an extremely wide set of programming language abstractions. So at first sight, this kind of approach to cross-language communication might sound reasonable. But we will now review why it is questionable in general, and downright unsuitable for use in OS-application communication.

Dealing with incompatible standards by creating another one

There is some xkcd wisdom to be invoked whenever one tries to fix the problem of interoperability between different approaches to the same problem, by proposing yet another approach to that problem. Even more so when the proposed solution works using a “universal adapter” approach.

In the case of communication between programming languages using a largest common denominator approach, it is, in addition, worthwhile to sit for a second and reflect on what the engineers behind such systems are trying to do. Irrespective of the specific way they are going about it, they are effectively trying to map nearly every programming abstraction in existence to nearly every other programming abstraction in existence. In other words, they are trying to create a universal programming language.

Industry experience with C++, which no developer can claim to fully understand, and which still struggles with standardization dilemmas as basic as standardizing its ABI, should remind us that this is not a very good idea. But that’s before the extra design goals of such system even come into play.

On text as a support for data exchange

See, when one is megalomaniac enough to try to design a universal programming abstraction, one needs huge amounts of manpower. To justify the expense, goals as simple as “making different programming languages communicate” are not enough, since problem-specific ad-hoc approaches turn out to be simpler, more efficient, and cheaper to implement. So one needs to think even greater purpose. Typically, said greater purpose is that it should be possible not only to exchange data between programs locally, on a given machine, but also to send it, over an unreliable network, to another computer possibly running a totally different hardware and OS architecture.

This is typically the point where clever people start converting everything to XML and JSON, and pragmatic people realize that all sanity has been lost.

ASCII text-based data serialization is great for data interchange that occurs very infrequently, in very ill-defined systems, and can’t be done otherwise. For everything else, just considering the performance overhead of converting data to text and vice versa makes one realize how bad of an idea this is. If one more argument is needed, consider how universal data storage formats, like XML, are intrinsically hard to parse efficiently, and due to over-engineering also often involve sending much more bytes than necessary just to conform to a well-specified grammar.

Why do people do this ?

At this point, if you agree with my rant so far, you may start wondering why people go at the problem in such a convoluted way.

In many specific cases, it’s possible to solve cross-language communication problems without expending that much effort. Many programming languages offer facilities to convert some of their internal data representations to C’s, and that can be used as a last-resort lingua franca when no better communication mechanism is available. Then, for any specific (Language A, Language B) pair, one may easily think of a communication binding that is a lot more straightfoward than any of the engineering atrocities committed in the name of generality.

The reason why this kind of approach is impopular, basically, boils down to algorithmic complexity. If we denote N the amount of supported programming languages, creating language-specific communication bindings is O(N^2), whereas creating a standard inter-language communication binding is O(N). So as N starts to grow, we expect the second approach to become a lot more efficient than the first one.

Plus, it’s a lot more stimulating to work on a Universal Cross-Language Network-Transparent Communication Framework than it is to write one more binding between C and your favorite programming language of the day.

What I’d argue, though, is that for small values of N (there aren’t that many popular programming languages), O(N^2) may easily win over O(N). Especially when the prefactor in front of the N is very high. And when its looser problem definition comes with serious caveats in such critical areas as communication efficiency.

And I’d also wager that due to how hard it is for a new programming language to gain traction, and how easy it is for programming languages to disappear after a while (who learns Basic these days ?) we’ll likely never be able to claim that we’re in the large N case.

Conclusion

As of 2014, the dream of a universal programming language fitting all use cases is dead. People have learned to accept that different programming languages fit different purposes more or less well. Some programs are best written in Assembly, others are best written in Python.

However, a tenacious remnant of the myth of the universal programming language is the idea that someday, all programming languages should learn to play well with one another. That operating systems should be able to interact equally well with programs written in any programming language. That one day, we won’t need to explicitly support programming languages in computer systems, and will be able to write programs in one language, that interact equally well with programs written in any other, without any additional effort.

I hope I have shown convincingly here that in the specific context of OS-level development, where performance is a critical concern and low-level facilities must be kept simple for security reasons, this is about as desirable and likely as a worldwide mandate on the universal use of Esperanto.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s