2013-09-29

UTF-32 is NOT a fixed-legnth character encoding

Recently, Hacker News guys are discussing about Unicode and its many encoding schemes.

UTF-8 – "The most elegant hack" | Hacker News
UTF-8 Original Proposal | Hacker News

In the comments, many people believes UTF-32 is a fixed-length character encoding. This is not correct. UTF-32 is a fixed-length code point encoding. Character != Code point. Period.

Actually, I'm not good at Unicode or English as you see. But I think it is my duty to enlighten those blind people who still think characters in ASCII.

What is Unicode?

Unicode defines a set of code points which represents glyphs, symbols and other control code. It defines mapping between real glyphs to the numerical values called the code point. In Unicode, single code point does not necessarily represents single character.

For example, Unicode has combining characters. It has more than one way to express the same character. ñ can be expressed in Unicode code point by either U+00F1(LATIN SMALL LETTER N WITH TILDE) or U+006E(LATIN SMALL LETTER N) followed by U+0303(COMBINING TILDE). This way a sequence of Unicode code points semantically represents single character. Japanese has such characters too. Thus, in Unicode, Character != Code point.

Another expample is a feature called Variant Selector or IVS(Ideographic Variation Sequence). This feature is used to represents minor glyph shape differences for semantically the same glyph. CJK kanzis are the typical example of this. It's consist of Unicode code sequence, beginning with ordinary code point for the glyph, followed by U+FE00 to U+FE0F or U+E0100 to U+E01EF. If followed by U+E0100, it's the first variant, U+E01001 for second variant and so on. This is another case where a sequence of code points represents single character. Wikipedia said, additionally, U+180B to U+180D is assigned to specifically for Mongolian glyphs which I don't know much about it.

Now we know that the Unicode is not fixed-length character mapping. We look at the multiple encoding scheme for Unicode. Unicode is a standard for character mapping to the code point and its not the encoding scheme. Encoding of Unicode is defined by multiple way.

UTF-16

UTF-16 is the first encoding scheme for the Unicode code points. It just encode each Unicode code points by 16 bits length integer. A pretty straightforward encoding.

Unicode was initially considered to be 16 bits fixed-length character encoding. "16 bits ought to be enough for all characters" - a stupid assumption by some idiotic western caucasians so called professionals who has no real knowledge of real world glyph history I presume. Anyway This assumption is broken single-handedly by Japanese since I am fairly certain that Japanese has more than 65536 characters. So do Chinese, Taiwanese(although we use mostly same kanzis, there are so many differences evolved in the past so I think it can be considered totally different alphabets by now) and Korean(I've heard their hangeul alphabet system has a few dozen thousand theoretical combinations). And of course many researchers want to include now dead language characters. Plus Japanese cell phone industries independently invented tons of emozi. It never ends.

So, now Unicode has more than 2^16 code points, single code point cannot be encoded in 16 bits anymore.

UTF-16 deal with this problem by using variable-length coding technique called surrogate pair. By surrogate pair, two 16 bits UTF-16 unit sequences represents single code point. Thereby breaking the assumption of 1 unit = 1 code point. Combining with Unicode's combining characters and variant selectors, UTF-16 cannot be considered to the fixed-length encoding in any way.

But, there is one thing good about UTF-16. In Unicode, most essential glyphs we daily use are squeezed to the BMP(Basic Multilingual Plane). It can fit to 16 bits length so it can be encoded in single UTF-16 unit(16 bits). For Japanese at least, most common characters are in this plane, so most Japanese texts can be efficiently encoded by UTF-16.

UTF-32

UTF-32 encodes each Unicode code points by 32 bits length integer. It doesn't have surrogate pair like UTF-16. So you can say that UTF-32 is fixed-length code point encoding scheme.

But as we learned, code point != character in Unicode. Unicode is variable-length mapping of real world characters to the code points. So UTF-32 is also, variable-length character encoding.

But It's easier to handle than UTF-16. Because each single UTF-32 unit guarantees to represent single Unicode code point. Though a bit space inefficient because each code points must be encoded in 32 bits length unit where UTF-16 allows 16 bits encoding for BMP code points.

UTF-8

UTF-8 is a clever hack by... who else? THE fucking Ken Thompson. If you've never heard the name Ken Goddamn Thompson, you are an idiot living in a shack located somewhere in the mountain, and you probably cannot understand the rest of this article so stop reading by now. HE IS JUST THAT FAMOUS. Not knowing his name is a real shame in this world.

UTF-8 encode Unicode code points by one to three sequence of 8 bits length unit. It is a variable-length encoding and most importantly, preserve all of the existing ASCII code as is. So, most existing codes that expects ASCII and doesn't do the clever thing just accept UTF-8 as an ASCII and it just works! This is really important. Nothing is more important than backward compatibility in this world. Existing working code is million times more worth than the theoretically better alternatives somebody comes up today.

And since UTF-16 and UTF-32 are, by definition, variable-length encoding, there is no point prefer these over UTF-8 anyway. Sure, UTF-16 is space efficient when it comes to BMP(UTF-8 requires 24 bits even for BMP encoding), UTF-32's fixed-length code point encoding might comes in handy in some quick and dirty string manipulation, But you have to eventually deal with variable-length coding anyway. So UTF-8 doesn't have much disadvantages over previous two encodings.

And, UTF-16 and UTF-32 has endian issue.

Endian

There are matter of taste, or implementation design choice of how to represents the bytes of data in the lower architecture. By "byte", I mean 8 bits. I don't consider non-8 bits byte architecture here.

Even though modern computer architectures has 32 bits or 64 bits length general purpose registers, the most fundamental unit of processing are still bytes. The arrary of 8 bits length unit of data. How to represent more than 8 bits of integer in architecture is really interesting.

Suppose, we want to represents 16 bits length integer value that is 0xFF00 in hex, or 1111111100000000 in binary. The most straightforward approach is just adapt the usual writing order of left-to-right as higher-to-lower. So 16 bits of memory is filled as 1111111100000000. This is called Big Endian.

But there is another approach. Let's recognize it as 8 bits unit of data, higher 8 bits 11111111 and lower 8 bits 0000000, and represented it as lower-to-higher. So in physical 16 bits of memory is filled as 000000001111111. This is called Little Endian.

As it happens, the most famous architecture in Desktop and Server is x86(now its 64bit enhancement x86-64 or AMD64). This particular architecture choose little endian. It cannot be changed anymore. As we all said, Backward compatibility is so important than human readability or minor confusion. So we have to deal with it.

UTF-16 and UTF-32 each requires 16 bits/32 bits length of integer. and the internal representation of integer maybe Big Endian or Little Endian. This is a real pain if you store text in the storage or send it over the network.

UTF-8 doesn't take any shit from this situation. Because its unit length is 8 bits. That is a byte. Byte representation is historically consistent among many architectures(Ignoring the fact there were weird non-8-bits-byte architectures here).

Minor annoyance of UTF-8 as Japanese

Although UTF-8 is the best practical Unicode encoding scheme and the least bad option for character encoding, as a Japanese, I have a minor annoyance in UTF-8. That is it's space inefficiency, or more like its very variable length coding nature.

In the UTF-8 encoding, most Japanese characters each requires 24 bits or three UTF-8 units. I don't complain the fact that this is 1.5 times inefficient than UTF-16 for BMP so my Japanese text file is 50% bigger. The problem is, in some context, string length is counted by the number of units and maximum number of units are so tight. Like the file system.

Most file systems reserve a fixed amount of bits for the file names. Linux kernel's default file system Ext4, for example, reserve 255 bytes(1 bytes = 8 bits) for a file name. So the length limitation of file name is not counted by the number of characters, but number of bytes. Most GNU/Linux based distros now use UTF-8 as the default character encoding so the character encoding of ext4 is also UTF-8. For people who still think it in ASCII(typical native English speaker), 255 bytes is enough for the file name most of the time. Because, UTF-8 is ASCII compatible and any ASCII characters can be represented by one byte. So for them, 255 bytes equals 255 characters most of the times.

But for us, The Japanese, each Japanese characters requires 3 bytes of data. Because UTF-8 encoded it so. This effectively divide maximum character limitation by three. Somewhere around 80 characters long. And this is a rather strict limitation.

If UTF-8 is the only character encoding that is used in the file system, We can live with that(although a bit annoying). But there are file systems which use different character encodings, notably, NTFS.

NTFS is Microsoft's proprietary file system that format is not disclosed and encumbered by a lot of crappy patents(How could a thing that can be expressed in a pure array of bits, no interaction with the law of physics can be patent is beyond my understanding) so you must avoid using it. The point is, NTFS encode file name by 255 UTF-16 units. This is greatly loosen the limitation of maximum character length for a file name. Because, most Japanese characters fits in BMP so it can be represented by single UTF-16 units. In NTFS, even the Japanese can practically assume 255 units = 255 characters most of the times.

Sometimes, We have to deal with files created by NTFS user. Especially these archive files such as zip. If NTFS user take advantage of longer file name limitation and name a file with 100 Japanese characters, its full file name cannot be used in other file systems. Because 100 Japanese characters requires 300 UTF-8 unites most of the time. Which exceeds the typical file system limitation(255 bytes).

But, this is more like file system design rather than the problem of UTF-8. We have to live with it.

Unicode equivalence - Wikipedia, the free encyclopedia
異体字セレクタ - Wikipedia(Currently, no English Wikipedia entry for this)

2 comments:

Anonymous said...

'a stupid assumption by some idiotic western caucasians' これは笑った。
色々突っ込みたいところは有るけど、一点
enlightenではなくてenlightです

Unknown said...

Great article, Thank you for this Sir !