Even for the kind of minimal Unicode support which I want in the kernel, string handling code will still require access to a small character property database, which is used to determine if two Unicode strings are canonically equivalent or not. And the question which I’ve had to ask myself recently was, how can I efficiently store and use such a character database in an OS kernel? As it turned out, properly answering that question was a bit more tricky than I expected, so here’s a tale of how it went…
An word about Unicode string comparison
To determine if two Unicode strings are strictly equivalent or not, two character properties are useful : canonical decomposability, which basically states whether a Unicode character can be decomposed into an equivalent sequence of code points, and canonical combining classes, which states in which circumstances code points in a string can be swapped without altering text semantics.
Thankfully, the Unicode standardization committee has done a fairly good job at keeping the amount of canonically equivalent code point sequences minimal, and we’re only talking about 2053 canonical character decompositions and 653 “swappable” code points of nonzero combining classes in the Unicode standard version 6.2.0, which is only a small fraction of the hundreds of thousands of characters which the Unicode Character Database contains.
The question is, however, how and in which form should I transmit a list of these “special” characters to the kernel ?
Extracting relevant character data
The Unicode Character Database is a large digital compendium of Unicode character properties, spread around multiple files which together weight around 16.4 MB in uncompressed form. And that’s without including extra information about the Han ideographs used by Chinese and Japanese, or some “extracted” properties which are nontrivial to access directly.
Obviously, it would be overkill to include and parse all that stuff in the kernel when all I want is to be able to compare strings with one another. Therefore, I have extracted the relevant subset of this database for this purpose in two simplified text files, one which associates decomposable code points with their decompositions, and one which associates “swappable” code points with their canonical combining class.
This time around, the extraction process has been manual, but I want to keep it possible to easily automate it in the future, so that support for future versions of the Unicode standard can be easily added in the future, across all supported architectures. A noticeable consequence of this is that it would be a bad idea to directly extract a binary version of the database to be statically linked into the kernel : in the prospect of future automation, this would mean having to support each architecture’s binary data storage formats in a cross-compiling form, which would essentially be on a similar level of awfulness as maintaining a basic C compiler. No, thanks.
Giving the kernel access to the extracted data
To avoid this issue, the best option would probably be transmit the extracted database to the kernel in a textual form, and then have the kernel perform the text->binary conversion by itself at boot time. On a modern machine, the overhead of loading and parsing 32KB of data at boot time should be negligible, and by doing things this way, we make use of the fact that we already have a cross-compiling C/C++ compiler at hand, which knows very well how various architectures handle binary data.
There are multiple ways to transmit the text database to the kernel, however. One way is to statically link it into the kernel, just like one would for a binary database. The main advantage of this method is that it is simple to implement, the main issue with it is that it makes it difficult to reclaim the memory used by the text database once a binary version has been generated, since as far as kernel loading and memory management is concerned it’s just a part of the kernel’s data segment.
To make more efficient use of RAM, a better option would probably be to have GRUB load the files as a kernel module, and this is what my current prototype code does. So far, one question remains, though : what is the best way to give kernel code access to modules ? Should I just make the list of kernel modules a global variable and be done with that, or is there a more elegant, yet not cumbersome way to go about it. I have no answer yet on that matter, so if someone has got an idea…
Update: So, after thinking about this a bit, I think the most elegant way to go about this is probably to create a new kernel component dedicated to the management of modules, called ModuleManager. I’m drafting a design of that right now, and will further describe that in a new post once it’s done.