More details on Unicode string comparison issues

In my last post about the upcoming switch of internal kernel strings to Unicode, I mentioned that the only string manipulation operation which I expected to involve significant design work would be the string comparison operations “==” and “!=”. Now that I think that I have reached a sufficiently advanced point in my study of the Unicode standard, I would like to discuss this issue in a slightly more extended fashion.

Canonical equivalence

In a nutshell, the Unicode standard has mostly been devised by merging the huge pile of language-specific text encodings that existed before it, and then adding some extras. This process was carried out in a manner that attempted to avoid unnecessary duplication, while still offering lossless text convertibility from existing encodings. But as a result of this, some redundant and incompatible designs, such as encoding accented Latin characters like “à” either as a single code point or as a (letter, diacritic) pair of code points, were both kept in the Unicode standard. This, along with other Unicode design decisions, effectively provides multiple ways for Unicode-compliant software to encode some characters.

It is obvious that string comparison operations should not be fooled by such redundant encoding, and to help at this job, the Unicode standard provides so-called “canonical code point decompositions”. It is a property of all Unicode code points, which maps them into a smaller, non-redundant part of the Unicode character set, without discarding any semantic information. By enforcing such a decomposition on all input strings (which is called “Normalization Form D” or “NFD” in Unicode parlance), one can ensure that a given sequence of characters will always be encoded using the same stream of integers, which in turn allows string comparisons to be carried out using simple binary operations. And this is pretty much the way I intend to handle things.

Compatibility equivalence

But sometimes, the semantical distinction between two canonically distinct Unicode characters can also be very weak. As an example, there are two separate Unicode code points for the “mu” Greek letter and the “micro” mathematical symbol, whereas both are rendered using the very same glyph (“µ”) in most fonts. As another example, one can also consider half-width katakana characters in Japanese, which, as the name suggests, are just an alternate representation of a given character that is to be rendered using a thinner glyph. However, when such things happens, it is generally the result of an attempt to allow lossless conversion of legacy text to Unicode text. And wherever the existence of such similar characters goes against the philosophy of the Unicode standard in some way, the standard does provide a lossy way to convert such deprecated characters to their “idiomatic” Unicode equivalents, through the mechanism of “compatibility code point decomposition”.

This raises another question though: should string operations ignore such compatibility distinctions between characters ? Though call. On one hand, the distinctions between compatibility characters and their idiomatic equivalents can be so subtle that discriminating them may lead to weird bugs or security breaches through identifier spoofing. On the other hand, I would be a bit wary of ignoring semantical distinctions between characters wherever they exist, since they can sometimes be about as important as that between lower-case and upper-case characters in Latin scripts. And since the Unicode standard explicitly forbids silently converting strings to their compatibility equivalent representations through conformance requirement C7, an efficient string implementation would effectively have to simultaneously maintain two copy of every Unicode string in memory, one “original version” and one “compatibility version”, which would be at the same time wasteful and complex to implement.

For this reason, I suggest that in general, low-level OS string handling routines, which deal with fairly well controlled input should stick with NFD (“canonical”) string normalization. Compatibility decomposition, on its side, would be reserved to higher-level system processes which deal with uncontrolled output and can afford lossy text handling. As a rule of thumb, I specifically recommend that any user-facing operation that ignores case distinctions between strings should ignore compatibility distinctions as well.

And that will be it for today, I guess !

One thought on “More details on Unicode string comparison issues

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s