Re: A Great Idea (tm) about reimplementing NLS.

From: Bernd Eckenfels
Date: Fri Jun 17 2005 - 04:42:22 EST

In article <200506170450.12943.pmcfarland@xxxxxxxxxxxx> you wrote:
> (implication of utf8 and not utf16 goes here)
> Very few Unicode characters require three bytes, instead of the usual one or
> two.

UTF-8 2 bytes end with U+07ff which covers only Latin, Cyrillic, Hebrew and

All JCK Unified Ideographs (U+4E00-) and Extensions (U+3400-) have 3 byte
encodings with UTF-8. Some of the B Extensions even use 4 bytes (U+20000-)

> For one byte you just have the byte.

For ASCII you have one byte.

> For two bytes, you really have three: a control code stating "the following
> two bytes are a two byte character", and then the two bytes.

Umm, thats a bit missleading. UTF-8 works with bit not byte prefixes.
Unicode code points are integers and depending on the encoding represented
as multiple code points, which can be represented as bytes.

> Unless I've completely misunderstood the Unicode specification, this is what
> is going on.

You might want to look up Joel's Tutorial or just browse the Unihan Database:

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at