The problem with C (and its derivatives) as an OSdeving language

During the past year, two very enlightening programming-related things which I did were to read Steve McConnell’s Code Complete and to continue maintaining a complete mid-sized software system written in the fairly primitive Igor Pro development environement. This taught me a lot about programming in a language vs programming into a language, and the limits of this approach to clean coding. Today, I’d like to discuss why in spite of being theoretically suitable for coding anything, C feels to me like a terrible fit for OS development in our day and age.

First, a word of warning though. My goal here is not to bash the impressive work of Dennis Ritchie. Designing a programming language is hard, fiendishly so, and back in the day, he had to deal with hardware resources that are unimagibly small by current computing standards. Back in the day, “must compile quickly” was a difficult requirement, and “compiler must be easily portable” heavily limited the scope of what the language could do. By the standards of its time, C did pretty well, as evidenced by the success of the UNIX operating system which it was designed to implement.

What I’m saying is just that since C became popular, programming language designers have been hard at work on fixing its flaws and coming up with more sensible designs, and that nowadays, better alternatives exist. Their toolchain may be less mature, they may have less support in the wild, but their design itself is a significant improvement and I argue that for a new OS project, this is the most important thing, since every implementation issue can be fixed with a bit of willpower whereas design mistakes are (nearly) forever.

Unless otherwise noted, I’ll focus on C89, since it’s the latest standard that most compilers are guaranteed to support as of 2014. And so, let’s begin, first with the single most fundamental thing which a programming language is designed to manipulate…

The C way to data handling

Literals

The content of C literals is not preserved rigorously until the moment where the literal’s contents actually need to be evaluated, unlike in other languages such as Haskell or Ada. Rather, a C compiler is expected to convert a literal to some machine representation of limited precision before doing anything with it. The user can manually specify the precision of the representation with cryptic prefixes and suffixes (“f”, “u”…), and if he doesn’t, the compiler will just pick something that seems “good enough”.

This behavior causes a wide range of subtle bugs, as an example the following code will fail because it will attempt to compare the aproximations of 0.1 in single and double (or long double, on some architecture/compiler combinations) precision. Manually casting the literal to float, or defining it as float by appending an “f” to it, will fix the issue.

int main() {
    float f = 0.1;

    if(f == 0.1) {
        printf("Well, comparing floating-point numbers works.");
    } else {
        printf("Ow... maths are complicated ain't they?");
    }

    return 0;
}

There also are limitations to what can be put in a literal in C. Struct and array literals, in addition to being of limited expressivity (How do you know which part of a struct you are assigning stuff to? How do you ensure that assignment statements keep the same semantics as structs are modified?), are forbidden in anything but variable initialization, making it dangerously tempting to use throwaway variables in place, which go on to pollute the namespace, get reused for totally different purposes, become accidentally modified…

Boolean

Before C99, which is still ill supported by a number of C compilers, C did not have a proper boolean data type. This means that in real-world C code, this critical feature is emulated either through the use of handmade project-specific booleans (#DEFINE PROJECT_TRUE 1), or worse, by littering the code with integer literals that are supposed to evaluate to the right boolean value, but are much more difficult to understand or search for in plain text.

Not to mention that when integers variables are used to store boolean, many possible compiler optimizations based on storing multiple booleans in a single bitfield are lost. And that integers and pointers can consequently be used as condition variables, completely breaking any semblence of type safety which C could have even weakly pretended to have had.

Integers

What value is there to having multiple integer types of different size with unclear names, such a short, long, long long, and int which can be a synonym of either of the previous ones? Little, if the semantic of all of these variables depend not only on your specific compiler implementation, but also on the architecture you’re compiling for. How do you pick the right integer size for your application, then? How do you know which one is large enough?

Until the release of C99, which again is still not fully supported in the real world (and not in common use), there was no standard answer to this question. Picking the right integer for a given application involved a mixture of implementation-dependent sizeof() probing, (flawed) performance tests, and guru questioning. And usually, upon a compiler change, it would break anyway.

As of today, C still has no standard support for 128-bit, 256-bit and 512-bit integers, even though modern-day processors can perform a number of operations on such in an optimized way, and compilers have provided nonstandard intrinsics simplyfing support for it for a long while. Nor does it have support for integer sizes which aren’t powers of two, such as 24 bit, which is commonly used in graphics and audio processing, and consequently has plenty of hardware support on a number of architectures.

It also offers no handling of integer overflow and underflow, or other kinds of boundary checking, in spite of many hardware architecture now offering native support for this very important defensive programming feature.

Finally, for a low-level programming language, C also has remarkably limited support for bitwise integer processing. It does have bitwise boolean operations, with a syntax that is dangerously similar to that of regular boolean operations, but it doesn’t have an easy and standard way to handle multiple integer values stored inside of a single machine word. Bitfields, which were designed to solve exactly this problem, were (un)standardized in such a way that they are useless to any C program which wants to achieve compatibility across multiple compilers and architectures.

Real numbers

I have previously hinted at the way C’s litteral handling can make floating-point computation more difficult than it should be. But it’s also worth noting that it is only the great success of IEEE 754 as a standard for representing floating-point numbers that allows C to have relatively consistent floating-point semantics across hardware. On some legacy architectures predating IEEE 754 hardware support, float semantics in C may not reflect what you would expect when it comes to rounding, special values (NaN, +/- inf) and system exceptions.

For a low-level language, C actually gives very little access to the true richness of the machine’s floating point handling capabilities. All the gory details of floating point handling, from picking between the many kinds of rounding available, to awareness of the mantissa and exponent bounds and their impact on computational precision, are provided by the standard library in some cases, and in many cases not provided at all (though C99 improved upon this a bit).

C also has no native support for fixed-point numbers, even though these come in handy on embedded systems, when there is a need for extra speed on certain operations, when predictability is preferred over convenience, or when the idiotic endless decimals of aproximate floating-point numbers can become a legal liability (think about financial transaction amounts). Due to the way C handles real literals, fixed-point support would be very hard to implement efficiently.

Arrays

In order to accomodate for the limited computer resources of its time, C famously doesn’t store array sizes alongside arrays themselves. This means that whenever arrays are considered, length information must either be carried alongside arrays in separate variables or nonstandard structures, or encoded inside of the array itself as a special value, the way null-terminated strings work.

The former makes array manipulation more complicated, while the later is a horrible idea for both performance and security reasons, that I will discuss in more details in the part of this article about string handling.

C does have support for multidimensional arrays, which it silently serializes in one dimension. However, the lack of run-time array length storage greatly limits their usefulness, by imposing that the size of these arrays be hardcoded inside of the source code. It is impossible, in C, to build a multidimensional array at run time.

In a textbook example of abstraction leakage, C array indexing also must always be based on integers, and start with 0. If you want, say, an array that is indexed by an enum, then you’ll need to do some horrible things to turn that enum into an integer that matches these properties, and pick the right size when allocating your array.

Characters and strings

Like many programming languages of its time, C was initially unable to reliably process text written in any language but English, and optionally a couple more European languages through platform- and compiler-specific hacks. Even in C99, support for non-ASCII characters in source files still is “implementation-dependent”, effectively making it fiendish to use string literals written in languages other than English unless you have the native ability to read soups of \x’s and \u’s.

In today’s international, multilingual world, where Unicode use has become mandatory rather than optional, dealing with C to process any kind of textual content is an elaborate form of masochism. It is possible to do it right, but the language and existing code library will do all they can to bring your efforts to waste, most frequently by confusing character and byte semantics. To the C designers’ credit, the standard library actually now gets it right most of the time.

Beyond this perfectible handling of isolated characters, C strings also feature one of the worst ideas ever introduced in the area of text processing, null termination. This “feature”, which was probably designed so as to overcome the 255 character limit of Pascal strings without consuming more memory, is making it difficult to write reasonably efficient AND straightforward text manipulation code, due to its requirement that strings be parsed with lots of inefficient byte-wise operations anytime their length needs to be known for some reason.

Null termination also enables a wide range of buffer overflow vulnerabilities that continue to plague nearly all C software to this day, and can prove particularly difficult to patch in concurrent settings where string length can change over time.

Another area where C text processing shows its baroque side is on the lack of native language facility for simple string manipulation like concatenation, instead requiring programmers to use standard library functions with cryptic names and obscure syntax for this purpose. One can speculate that regular expression syntax might not have turned out so cryptic, if Perl developers had not been exposed to high doses of ionizing format string radiation during their childhood.

Type safety

But at least, unlike say Python, C has proper data types that can’t be randomly assigned to each other or compared without triggering the holy anger of the compiler, right? Well, let’s just say that C is about as useful for enforcing type safety as a sieve is useful for keeping water on its inside: it probably can keep it around for a short while, but everything will be gone very soon.

C has this annoying habit of silently casting anything into anything else in an attempt to be helpful. Turning ints to floats (but int quotients to incorrect float quotients), floats to double, double to float (DATA LOSS!), float to int (DATA LOSS!), int to boolean expressions, boolean expressions to int, pointers to boolean expressions…

So, if you still managed to keep a piece of data safe from C’s awful data handling habits so far, do make sure that an unfortunate conversion does not inadvertently destroy it (as in double => int) upon a function call, and that you’re doing what you intend to with it. With typedef being the joke that it is, nothing prevents you from comparing apple to oranges… or adding centimeters to inches, the way the Hubble engineers inadvertently did it.

Even statements are trying to kill you

Assignment as an expression

Premature optimization is the root of all evil, a wise man once said. Perhaps this thought should have crossed the mind of whoever suggested that C assignments should be expressions, rather than statements. This feature already had marginal use cases back when optimizing compilers did not exist, and was effectively obsoleted by then, except for encouraging people who commit readability sins in the name of conciseness (or just for fun).

However, it most certainly has created a very wide class of typical programmer errors, caused by the legality of statements such as “if(number = 1) do_something();” and “bool equality_check = (a = b);” which almost never match the intent of people accidentally writing them.

It also created the semantic distinction between variable++ and ++variable, which although of little practical consequence for C development is still misunderstood by many users of C-inspired languages, and can cause serious performance problems in object-oriented languages allowing for operator overloading such as C++.

Empty statements that are just a keystroke away

First, don’t get me wrong: NOP is a very important instruction in any programming language. The ability to have the processor do (nearly) nothing in specific circumstances is critical to many applications, including timing without timer interrupts, spin-locking, and emulating procrastination in computer software.

But a NOP should not be as easy to type as inadvertently leaving one extra semicolon somewhere. Because otherwise, you get this…

if(nothing_is_blocking_the_way());
{
    run_train_forward_at_full_speed();
}

…and this…

while(humanity_behaves_nicely_enough());
{
    save_the_justs();
}

initiate_rapture_of_doom();
Fallthrough switches

Hardware developers are fond of numerical status codes as a way to describe various conditions. Consequently, any self-respecting low-level programming language needs a clear way to take specific action depending on the value of a numerical parameter. In the C family, this is done through the switch() statement.

There are nearly as many ways of implementing switch() statements as there are programming languages, which vary greatly in sophistication. C, however, due to its legacy nature, has one of the most primitive ones out there: a case, in a switch statement, can only cover one single value of an integer type, which must be specified as a literal.

This obviously would become a problem as soon as software needed to take the same action upon various values of a numerical status code. And thus, fallthrough switches were born:

switch(an_integer_value) {
  case 0:
  case 1:
    //Do something for integer values 0 and 1
}

To account for this use case, switches in C were designed such that execution does not stop once a case statement has been covered. In practice, this means that every switch statement in every C program, including those that do not need this fallthrough feature, needs to terminate every single one of its case statements with a “break;” statement. Otherwise, disaster will ensue.

There’s something to be said about such designs which optimize for an uncommon use case by making life extremely difficult for people in the most common use case.

“Roll your own iterator” for loops

Computer programs do lots of repetitive things, especially traversing arrays and array-ish structures. And for loops thus have been a feature of almost all programming languages. In the design of C, Ritchie was visionary enough to realize that a for loop that can only count integers, as in Pascal, was probably not generic enough to account for the full breadth of repetitive jobs that computers can do, including traversing trees and linked lists. And thus, he came up with a design where a programmer can specify himself the initialization, continuation criteria, and iteration statement of a for loop.

What he failed to do, however, was to provide a generic for loop that makes the common case of traversing an array, or any other common container for that matter, actually easy. It may be argued that this is probably linked to C arrays’ aforementioned inability to keep track of their own length. Whatever the reason, however, this meant that programmers, not computers, ended up doing the very repetitive task of specifying for loop operation for any kind of common container traversal.

And, as humans are terrible at doing repetitive things late at night, people ended up doing pretty badly at it. Practical consequences of C’s for loop syntax includes plenty of off-by-one errors causing mistakes in array initialization, writes to an area of memory that shouldn’t have been accessed, and loops that don’t terminate causing program freezes.

Death by a thousand cuts

I could go on and on about the many ways through which the C syntax is incredibly confusing to the reader, and how the fact that some people manage to understand it reliably is actually a significant human achievement.

I could discuss how “int a, b;” defines two integers, but “int* a, b;” defines one pointer and one integer.

I could invoke the syntax of function pointer declaration, and cover once again the dangerousness of using two nearly identical syntaxes for very different operations, as in “&” vs “&&” and “=” vs “==”.

Ultimately, though, I have to conclude at some point on this. And my conclusion is that the C syntax, although it was designed to be easy to process by compilers and probably did well at it, is not easy to process by the human beings. And that’s a problem, because code is read A LOT more often than it is written. And because this syntax makes it easy to commit big mistakes, and hard to subsequently notice them in code review and debugging.

That makes C an inadequate fit for any kind of software project where reliability is paramount, as is arguably the case with OS development. OSs must be reliable because when an OS goes down, every application on the machine goes down with it. And so, OSs must be considered as critical applications as the most critical application that may run on top of it.

But there is yet more to it.

Codebase organization… or lack thereof

cAse sensitivity

Want to step in a big can of worm? Start discussing the issue of case sensitivity with a guru of UNIX and related technologies. It is possibly as ridiculously sensitive of an issue as the ability to access arbitrary memory addresses to a programming language nerd (“But it breaks garbage collection!” “That, sir, is an issue with your technology, not my code”). But as a seasoned troll, I’ll bite: case sensitivity, the way it has been implemented in the C programming language and most UNIX filesystems, is a serious usability flaw.

Here’s why: if you allow for this kind of case sensitivity in a computer namespace, it means that there can exist two objects bearing names which are semantically identical to a human being (e.g. “i” and “I”), but distinct to a machine, inevitably causing confusion. It is one more way errors are just one keystroke away in C. And it also means that the machine will get overly pedantic about the exact casing of an identifier. Which in some cases, mind you, is a quality. But most of the time, only is an annoyance.

A well-designed case-sensitive namespace should forbid the creation of two identifiers which only differ in casing, and also provide avenues to notify users of casing mistakes and/or correct them automatically. However, since this is difficult to implement, the second best option is to forgo case sensitivity altogether, and go case-insensitive but case-preserving, the way Microsoft implementations of the NTFS filesystem did.

Headers are an afterthought

In large software projects, it is a good practice to separate the interface (also known as specification) of a code unit from its implementation. It allows software designers to hand developers a precise, code-based description of what they have to implement, and testers a precise description of what they have to test, without the two parties having to interact with one another. In the context of software libraries, it also makes it possible for a library to disclose its interface without disclosing its implementation, allowing said implementation to change in the future without any fear that developers will have based their software on it.

C headers satisfy this use case to some extent. But in a way that is deeply unclean and hacked together. They natively support neither multiple inclusion (a single header is imported in two different ways), nor circular inclusion (two header files reference each other), requiring developers to abuse preprocessor directives in order to address this common use case. They offer no namespace feature to account for the common situation of two unrelated headers implementing identically named functions, meaning that a library update is all it takes to crash a build. And since it is not possible to make implementation-required header parts private, they still leak plenty of abstraction to the code including them.

No default function parameter values

Many programming books advocate against the use of functions with many parameters, most of which have default values. What they fail to acknowledge, however, is that for a class of applications, this happens to be just the tool that one needs. Command-line programs, after all, are semantically identical to functions with lots of optional parameters, and have proved very successful at solving real-world problems. And mathematical packages also easily grow to become parameter-heavy.

Data structuring can be attempted to reduce the amount of parameters which a function has. But it has drawbacks. It makes it more difficult to set defaults for some parameters. It introduces additional preprocessing steps before the function may be called. When a structure is created just so as to replace multiple function parameters, and never reused elsewhere, it is also greatly annoying to developers.

So many function parameters, most of which are optional, is not usually a bad thing. Except that C does not allow for it.

No named function parameters

Let us assume though, that you did, anyway, create a C function with lots of parameters, and have the patience to type each and every one of these parameters on every function call. How do you know that you are typing them in the right order?

This is actually a difficult task, because in C, there is no way to explicitly specify that a given expression should serve as a specific, explicitly named function parameter. You cannot, say, write “display_line_plot(trace=tr1, color=blue);”. Instead, you have to rely on library developers to use a parameter order that is reasonably logical, will never change in the future, and hope that they will never extend the functionality of the function you use with extra parameters because this would break your code.

No support for higher-level abstraction

For some use cases, functions are great. This is an important fact to remember in an age where it is customary to shoehorn objects into each and every programming problem there is. However, functions are only one of the simplest form of what is, in programming, called a abstraction. The abstraction provided by a function is, you put parameters in, you get a single result out, and you don’t need to know what has happened inbetween. And in C, that’s it.

More recent programming languages have introduced other kinds of abstractions that prove more useful for some problems. As an example, inheritance-based object-oriented programming has proved very useful in GUI toolkit development, where one is dealing with a hierarchy of reusable components sharing common low-level properties. Interfaces, which lets one specify acceptable input through what it can do, rather than what it is, is also a really nice one. And contract-based programming, which allows one to specify in code what are the preconditions, postconditions, and invariants of a program or subprogram, is also an extremely powerful tool when developing self-documenting code.

Developing in C, however, remains much akin to running around with the proverbial function hammer, and treating every engineering problem that comes to your sight like the proverbial nail.

Limited error handling

The facilities provided by C for handling errors in program execution are extremely limited. The typical way to achieve error handling in this language is to abuse function return values semantics, by e.g. using NaN as a return value for float functions that have failed, or negative results for functions that would normally return positive integers. Never mind, of course, that it becomes consequently a lot more difficult to differentiate bogus results from valid error codes. And never mind either that this form of error handling involves polluting every function call with error checking on the result if clean code is desired.

Alternatively, friends of slightly cleaner language may choose to use errno, instead, which is a global variable defined by the standard C library for the purpose of storing error codes. Never mind, of course, that this design is inherently thread unsafe, to the point where the entire OS feature of thread-local storage was primarily designed so as to keep it working on modern multithreaded OSs. And nevermind, either, the problematic side of using a single variable when potentially, several errors can occur in a row.

The real question is the following: when C code ends up with a nonzero errno variable, since errno has no standard semantics, how does it know what kind of error happened exactly, and where?

Let’s face it: in most cases, exceptions are just an infinitely better error handling mechanism than invalid function results or global variables with ambiguous semantics. Except that C’s alternative to exception handling, setjmp/longjmp, is pretty much as monstruous and tricky to get right as using GOTO in languages that have it. And is obviously nowhere near as automated and feature-complete as the expression handling mechanism of a modern programming language.

Macros

As we have seen, C has plenty of quirks and limitations. In its early releases, it did not even have a standard, non-hackish way to define constants. So obviously, especially in large projects, people were bound to hit its limitations at some point.

Ritchie’s answer to this was C preprocessor macros, basically allowing developers to mess with the compiler’s internal source file processing, in an attempt to fix language deficiencies with code transformations.

The problem with preprocessor macros is that since they are executed before expression evaluation, their semantics are extremely tricky to get right, even when one thinks that just enough parentheses will do the trick. For a well-known example of the contrary…

int a = 1, b = 2, c;
#define MAX(a, b) ((a) >= (b)) ? (a) : (b)
c = MAX(++a, b);  //What is the value of a and c at the end of this line ?

Standard library and related topics

Standard library limitations

Back when C was designed, it was not common for programming languages to have extensive standard libraries extensively covering common programming issues. Programmer would normally rely on OS-specific APIs or custom code for that purpose. This means that the C89 standard library was, by modern standards, very limited. It did not apropriately address issues such as threading and concurrency, Unicode string handling, hashing and cryptography, network communications, OS-agnostic file management, vector and matrix operations, database access, common implementations of variable-sized data containers, or unit testing.

The language’s standard library was subsequently extended to adress some of these use cases, providing developers with standard ways to perform these common tasks. However, new library features tend to rely on new language features, which as mentioned compiler manufacturers are pretty slow to support. And even with these improvements considered, C’s standard library remains pretty limited in scope, as compared to that of more modern programming languages.

Guru naming convention

Quick, what does the “f” in “fprintf” mean? If you have accurately replied “which f?”, you have just raised one of the issues which I am trying to discuss here. C standard library functions are named, like UNIX commands, in an extremely abbreviated style that makes them hard to read and even harder to distinguish from each other. The problem becomes particularly serious when you realize that functions which are vulnerable to buffer overflow are named in a similar way as functions which aren’t, to the point where it’s difficult to find all occurences of unsafe functions in a source tree without specialized tools that have in-depth knowledge of the C syntax.

By disregarding the basic way the human brain works (and, more specifically, how it processes natural language), C’s naming conventions makes the language difficult to learn, and programs using the language’s standard library difficult to read. Like a disease, such poor naming conventions also have a tendency to spread into third-party code too, meaning that if you end up reading a C program, chances that it will be a poorly readable mess, as the developers will have attempted to stay consistent with C and UNIX’s poor naming conventions.

But hey, I’m sure early C programmers typed slowly enough that the writing speedup was worth it.

Terrible memory allocation safety

Dynamic memory allocation is a surprisingly difficult problem, to the point where for really critical applications, many devs end up throwing the towel and forgoing it altogether in favor of fixed-size, statically allocated variables and arrays. However, for many practical applications, it’s important to produce programs which scale well to various workloads, a feat which statically allocated programs do not easily achieve.

Good dynamic memory allocation tools have to reach a good balance between flexibility (program’s ability to do what it needs to do, including accessing arbitrary memory adresses), performance (unlikelyhood that a car’s braking subsystem will suddenly freeze and stop working for a few seconds as RAM is liberated), and safety (making it hard for developers to shoot themselves in the foot). And C comes from a time where the former was paramount, the second unspecified by the standard, and the latter all but ignored.

Thus, in C, it is legal to allocate a variable on the stack in a function, then return a pointer to that variable as a function result. It is possible to liberate an unneeded chunk of memory that isn’t needed anymore, then mistakenly try to access it again without triggering a program crash. It is very easy to allocate objects, then forget to initialize them. And it is totally cool to allocate an array of ten 16-bit integers, then mistakenly treat it as an array of ten 32-bit integers.

Due to a mixture of terrible type safety, pointer handling that’s entirely devoid of safety guards, minimalistic array types, and difficult object initialization, C makes dynamic memory management particularly painful for developers. It is trivial to make memory management mistakes in C, as evidenced by the continuous security exploits and “segmentation fault” crashes that are periodically caused by these incidents in C programs.

Self-undocumenting code

One of the most absolutely certain facts about software engineering is that programs will be read and rewritten many times in their useful life. Features will be added, bugs will be fixed, exploits will be patched, and performance will be tuned. This means that the documentation describing how software works must continuously be kept in sync with these changes, which in practice turns out to be a hard software engineering problem.

This has prompted for two evolutions over the history of programming languages. First, many programming languages have attempted, in various ways, to reduce the need for external documentation. They have done so by using richer type systems, higher-level abstractions, and self-explanatory naming conventions in the standard library and elsewhere. Second, languages have increasingly started to provide tools, either natively or in their reference implementation, that allow documentation to be specified directly in the source code, in a standard way.

In the C community, though, things are far from being so rosy. The language is unclear down to its basic syntax and choice of standard identifiers. It has very few high-level constructs allowing for expressive code to be written. And due to the lack of a reference implementation of C, standard documentation tools can’t be provided. This leads to a proliferation of third-party code documentation tools such as Doxygen, with varying degrees of pain involved in their setup processes, making the process of writing self-documenting code in C, and maintaining it as code from various third parties is added, quite a bit difficult.

As cross-platform as Assembly

One benefit to C which is often invoked is its cross-platform nature, theoretically allowing for code to be written once and then compiled for many different compiler/OS combinations. However, the language leaves so many things unspecified or left up to the implementation than in practice, porting C code across multiple platforms is a true challenge. Worse yet, the language’s limited expressivity means that the bugs introduced by porting may be particularly subtle, and only triggered in specific conditions that are unlikely to have been expected by developers. A simple example of this are bugs caused by differences in integer size across platforms, and in particular wrong developer assumptions about pointer size that aren’t rejected at compilation time by the language’s weak type system.

There is no easy way to fix this

Fixing the language is difficult

Attempts to fix C’s flaws by extending the language, as done by newer versions of the C standard and by the Objective-C language (which is a strict superset of C), will necessarily be unable to patch a large amount of flaws. To stay compatible with existing code, they will not be able to patch unclear syntax (e.g. & vs &&), dangerous behavior (e.g. fallthrough switches), and mysterious naming. Moreover, precisely because they are striving for compatibility, users of C extensions will typically have to deal with heaps of legacy C code that doesn’t use their improvements (as no one bothers rewriting it in the new language), documentation that doesn’t fully take advantage of the new features, and developers that basically still write C89 code.

An alternative strategy is to write “close enough” clones, like C++ and D, that attempt to fix C’s top shortcoming while keeping compatible with the underlying language “philosophy”. In practice, however, these languages face similar issues as strict C supersets : former C developers complain about how unsafe constructs are now forbidden, “fix it” by riddling their code with casts, and basically come up with a way to still write C code that uses no new features of these improved languages. They also tend to be quite conservative about what they change from C, leaving around plenty of design mistakes from their ancestor in the name of familiarity to C users.

Finally, at the other extreme, there are C subsets like C– and MISRA C, which attempt to remove C’s most problematic features, and make a leaner language from it. Though they manage to suppress some problematic behavior in the way, these languages usually fail to gain popularity outside of niche settings (“why should I use this if C does more?”), and to propose a real break from the core issues in C’s design.

Sometimes, a clean break is necessary

Fundamentally, the failure of all these attempts to fix C’s shortcomings is rooted in their strategy to claim that they are still like C, so as to appeal to C developers. This heavily restricts the scope of changes that can be carried out. And it stifles the creativity of language designers when it comes to solving the previously mentioned software engineering problems that C’s popularity have put on the spotlight.

To propose a real alternative to C, that is neither a bloated superset nor an insufficient subset, requires a rethinking of the language’s core features and peculiarities, and user-oriented design that starts from real-world developer experience to build features that aptly address observed problems. It requires forgetting about compatibility, admitting that building the ReactOS of programming languages simply isn’t that interesting, and coming up with something that is actually new and brings actual improvements, though at the cost of some incompatibility with existing code.

There is an analogy to be made here with software refactoring: many times, cleaning up a codebase is worth the effort. It allows well-tested code to survive longer and serve new purposes. It makes for simpler, safer patches. It helps compatibility with existing data and infrastructure. However, at times, the best thing to do is to start afresh. Programming practices are just too awful. The underlying approach is just too flawed. Iteratively improving upon this mess would just take too much time.

Next week, I’ll take a wide look at other programming languages, and describe what alternatives to C are available to OS developers today.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s