Re: A Great Idea (tm) about reimplementing NLS.

From: Måns Rullgård
Date: Fri Jun 17 2005 - 04:18:55 EST


Patrick McFarland <pmcfarland@xxxxxxxxxxxx> writes:

> On Friday 17 June 2005 04:21 am, Måns Rullgård wrote:
>> Patrick McFarland <pmcfarland@xxxxxxxxxxxx> writes:
>> > On Thursday 16 June 2005 11:04 am, Lennart Sorensen wrote:
>> >> Most people seem happy with 50 or so being a good limit even though
>> >> many systems support much longer.
>> >
>> > 50 characters or 50 bytes? Because in the case of UTF-8, if you do a lot
>> > of three byte characters (which require four bites to encode), 50 bytes
>> > is very short.
>>
>> What do you mean by three-byte characters requiring four bytes to
>> encode? Is a three-byte character not a character encoded using three
>> bytes?
>
> (implication of utf8 and not utf16 goes here)
>
> Very few Unicode characters require three bytes, instead of the
> usual one or two.

I wouldn't the Chinese, Japanese, and Korean characters "very few",
and those all require (at least) three bytes.

> For one byte you just have the byte.

Correct.

> For two bytes, you really have three: a control code stating "the
> following two bytes are a two byte character", and then the two
> bytes.
>
> For three bytes, you really have four bytes: a control code stating
> "the following three bytes are a three byte character" and then the
> three bytes.

Wrong. The first byte indicates the total size of the character, but
it also contains data, like this:

0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Refer to the Unicode standard, section 3.9 for the full details.

>> As for 50 bytes being too short, many of the multibyte characters are
>> equivalent to several English characters, so fewer of them are
>> required. You have a point, though.
>
> Any English characters (ie, the first 127 ascii characters) map
> directly to the first 127 Unicode characters (if thats what you
> meant).

Let me clarify with an example. The common Korean name Kim consists
of three ascii characters. The Hangul spelling, ~, is encoded in
utf-8 using three bytes. Even though a three-byte character was used,
the number of bytes is the same.

--
Måns Rullgård
mru@xxxxxxxxxxxxx
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/