Up to now, the existing TOSP implementation has been managing text the DOS way. That is, it assumed all text input to be encoded in Latin-1 ASCII, emitted Latin-1 text output, and internally processed text in a fashion that is incompatible with Unicode strings (such as by assuming that two characters have equivalent semantics if and only if they are encoded using the same stream of bytes). But as every computer user who speaks a language other than English knows, such an approach does not scale well into a multilingual future. ASCII is not fit for the purpose of handling text input in an unknown language, never has been, and never will be. Since I do not plan to restrict my work to the English-speaking part of the world, I have decided that the aforementioned situation is a mistake that should be handled by gradually switching all text processing to Unicode-aware routines. This blog post will further expose my plans in this respect.
The rationale for Unicode support in an OS kernel
Handling Unicode code points can be a great source of complexity, and it would be best if that complexity could be left out of security-sensitive OS components such as the weakly protected lowest OS layers (kernel, bootstrap). But we humans put names on everything. We do so in a language that we speak, and preferably our native one unless some external constraint forces us to do otherwise. And thus, anything that has to do with user-written text must be able to handle characters from all languages in the world, and for that purpose Unicode, as an obsessively complete and widely recognized standard for language-agnostic text encoding, is an obvious candidate.
Could an OS kernel do without user-written text? Of course, as long as it never deals with file names or another kind of human-readable resource identifier. But that, in turn, can be an obsessively strong constraint. It means that every system resource must be labeled using numbers when referred to in a communication with the kernel. Since humans cannot accurately remember numbers and mentally deal with them, that means in turn that every system resource that is managed by the kernel must bear two distinct names, one human-readable one and one numerical one, and that both names must be linked to each other using some kind of database, such as a giant table in the OS doc, constants that are defined with one single identifier across the whole OS API, or a text file in the OS install that can potentially be corrupted by a hardware or software problem. In the end, maintaining both sets of name and the database that links them to each other could easily more difficult than just using human-readable names that machines can also parse, and associating them with numerical identifiers at run-time when the performance cost of dealing with strings is not acceptable.
Conversely, what does “supporting Unicode in an OS component” actually mean? It means that within the code of said component, data structures which are dedicated to text storage, and every operation that is carried out on them, must be Unicode-aware. But to do its job, an OS kernel should actually need very few text operations. Being able to allocate strings, free them, check whether two strings are semantically identical, and perhaps also import and export a few common non-native string formats should be enough. And among those, the only operation that can actually get tricky is the string comparison one. Which, given a sufficiently restrictive definition of string equivalence (e.g. not necessarily equating “Greek letter mu” with “Mathematical symbol micron”), shouldn’t be too much of a problem either, since Unicode natively implements facilities for that through the notion of canonical equivalence.
Unicode migration plan
UTF-xx: stated roles and proposed usage
Aside from defining about a million code points, their properties, and the relationships which they have with each other, the Unicode standard specifies three ways computers can store a Unicode string: UTF-8, UTF-16 and UTF-32.
- UTF-32 simply stores a Unicode string as an array of 32-bit integers, where the value of each integer is directly equal to that of the associated Unicode code point
- UTF-16 achieves more memory-efficient text storage at the cost of some extra parsing complexity, by encoding code points as single 16-bit words in most cases and pairs of 16-bit words in the worst case
- UTF-8 stores strings as a stream of bytes, which is neither memory-efficient nor easy to parse but ideal for text interchange between computers as it averts endianness issues
21th-century computers feature such large amounts of RAM that memory efficiency in text manipulation is not a major issue anymore. Thus, I believe that text manipulation inside of programs should be done in the UTF-32 representation, whereas text storage and interchange should be done in the UTF-8 representation. This is also the convention used by most UNIX programs, so I’m not doing anything revolutionary there.
Unicode string storage structures and basic manipulation
Since both UTF-8 and UTF-32 are to be used, two container objects are to be created and used inside of Unicode-aware system components. A “KUTF32String” structure, composed of an array of 32-bit unsigned integers and a “size” pointer-sized unsigned integer specifying the length of said array, would likely be fine for UTF-32 string storage, whereas a “KUTF8String” structure, composed of an array of unsigned bytes and a size counter would do the same job for UTF-8 string storage. The minimal feature set which these objects would have to implement would be…
- Data allocation, liberation, duplication and concatenation
- Conversion of KUTF32String to KUTF8String and vice versa for data transmission and archival
- Conversion of ASCII-encoded char null-terminated strings to KUTF32String for C compatibility
- Efficient conversion of UTF8-encoded char null-terminated strings to KUTF8String for C99/C++11 compatibility
- Efficient conversion of char32_t null-terminated strings to KUTF32String for C99/C++11 compatibility
- Access to the code point length of a KUTF32String for manual parsing
- Extraction of individual code points from a KUTF32String for manual parsing
- Efficient comparison of KUTF32Strings for equality in sense of canonical Unicode equivalence
Notice that in C and C++, the same char type can be used to encode both ASCII and UTF-8 null-terminated strings. This is not an error from my part. According to my quick (and possibly flawed) research on Unicode support in the C family, the C99 and C++11 standards both feature this ambiguity by design. When the time comes, I am going to look into the precise standard wording, to see if the ANSI committee has at least taken the care to mandate the presence of distinctive Unicode characters (such as a Byte Order Mark) in UTF-8 strings, which would make it possible to programmatically differentiate them from legacy ASCII strings. Otherwise, I’ll have to either make assumptions on string content (and notify developers of them) or force developers to use explicit string conversion routines, both of which are fairly frustrating ways to go around a design oversight.
Transition of existing ASCII text handling code to Unicode strings
Once Unicode-capable data structures and routines are available, the next step will be to transition existing ASCII text handling code to the new way of doing things. This will be done by parsing the kernel and bootstrap code, looking for stuff that operates on char*, char or KString data, and then making sure that each snippet can actually handle Unicode data efficiently given a mere swap of data structure. Otherwise, care will be taken to make the code work with Unicode data if possible, and to explicitly mark it as restricted to the Latin-1 subset of Unicode otherwise. In any case, once I am done, there must be no code anymore which silently interprets a stream of bytes as ASCII text.
For an example of code that cannot be trivially translated to Unicode strings, consider debug text output. Since the 80×25 VGA text output which is currently in use only supports Latin-1 strings, one cannot expect Unicode strings to be properly displayed by routines that use it. So until a better mean of text output can be set up (which will come after framebuffer support), these debug routines will be configured to print unsupported Unicode character as a replacement character (such as a white rectangle) and explicitly labeled as incapable of true Unicode text display in the code and library documentation.
Operating systems have to manipulate user-written text frequently, and Unicode is the most sensible way to represent such text in a language-agnostic way. Therefore, modern operating systems should support Unicode down to their lowest-level layers, and TOSP is no exception to this rule. I have previously made the mistake of restricting my code to Latin-1 text input and output, and am going to correct it before it has dramatic consequences. Specific care will be taken to support existing Unicode handling facilities in the C and C++ languages, which are used to develop this OS, and to explicitly mark library functions which cannot fully support Unicode at this stage of development as such. And that’s it for today, so thank you for reading!