Re: A Great Idea (tm) about reimplementing NLS.

From: Lennart Sorensen
Date: Fri Jun 17 2005 - 08:11:20 EST

On Fri, Jun 17, 2005 at 04:49:33AM -0400, Patrick McFarland wrote:
> (implication of utf8 and not utf16 goes here)
> Very few Unicode characters require three bytes, instead of the usual one or
> two.
> For one byte you just have the byte.
> For two bytes, you really have three: a control code stating "the following
> two bytes are a two byte character", and then the two bytes.
> For three bytes, you really have four bytes: a control code stating "the
> following three bytes are a three byte character" and then the three bytes.
> Unless I've completely misunderstood the Unicode specification, this is what
> is going on.

You have probably slightly misunderstood UTF8 at least. UTF8 tries very
hard to make sure you can't mistake the characters for ascii, so it
makes the first byte contains some 1's follwed by one zero. The number
of 1's indicates how many bytes the character contains, after the 0 the
remaining bits is used to store bits for the character. The remaining
bytes are all 10xxxxxx which stores another 6 bites of the character code.
One is required to use the shortest form of utf8 that can store the
character you are encoding.

x's are where the bits for the character number go:
0xxxxxxx encodes character 0-127
110xxxxx 10xxxxxx encodes character 128-2047
1110xxxx 10xxxxxx 10xxxxxx encodes characters 2048-65535
etc up to
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx encodes characters

As far as I know, unicode doesn't currently define anything past 20bits
or so, so probably 4bytes is the most you will see in normal use, with 3
bytes covering quite a large number of the characters.

> Any English characters (ie, the first 127 ascii characters) map directly to
> the first 127 Unicode characters (if thats what you meant).

Well utf8 also is backwards compatible with ascii to make handling text
files simpler. You could encode the ascii characters using the other
part of UTF8 except that would violate the rule of using the shortest
form possible.

Len Sorensen
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at