Re: unicode (char as abstract data type)

Alex Belits (abelits@phobos.illtel.denver.co.us)
Tue, 21 Apr 1998 13:22:17 -0700 (PDT)


On Tue, 21 Apr 1998, Steve VanDevender wrote:

> Alex, this has passed most relevance to linux-kernel long ago.
> Please take your misunderstanding of what Unicode is supposed to
> do elsewhere.

If someone is trying to make Unicode a requirement for kernel interface,
it has everything to do with Linux kernel.

> Unicode is a _character set_. That is, it is a set of numeric
> encodings for a set of symbols used in writing nearly all the
> languages in the world today.

It's an _inadequate_ character set. Having it optional is ok, but
creating trouble using other charsets is completely unacceptable.

> Apparently many people are unhappy that Unicode doesn't preserve
> their home-grown 8-bit encodings for the characters used in their
> languages, but with tens of thousands of characters to represent
> that would be simply impossible.

You have no knowledge of reasons why people don't like Unicode then.
"Homegrown" encodings are national standards that exist for decades, and
often for a very good reasons.

> As it is many Japanese and
> Chinese people are balking at Unicode because it uses the same
> values for characters that appear identical in those two
> languages but that have different meanings. That's because
> Unicode is a _character set_; it's a code for a set of symbols
> used in writing, and not for syntactic or semantic information.
> If the same symbol is used in two different languages, even if it
> has a different meaning or role, Unicode uses only one code for
> it.

This is faulty reasoning. Text has a meaning, and lost distinction
between meanings is lost information.

> The current mish-mash of 8-bit encodings for various (mostly
> Indo-European) languages means that one must do language tagging,
> as the same character value could mean any number of different
> characters in these different encodings.

...and it doesn't seem that it will be possible to get away without
tagging anyway.

> Many non-Indo-European
> languages that use syllabic or ideographic characters could never
> be stuffed into 8-bit encodings.

Entirely wrong assumption. Languages that have more than 128 characters
use multibyte encodings. Languages that have less, use single-byte.
For example, modern Russian has 66 distinct glyphs, and with all archaic
variations and even other cyrillic-based languages added, it's still
8-bit.

> If you use Unicode then you can
> store and display documents that use any of those languages, or
> even all of them together, without having to be continually
> switching character set interpretations.

Also it means that the only thing I will be able to do with them is
displaying. Even hyphenation and phonetic match will be done completely
incorrectly.

> It also means that
> low-level applications (like the Linux kernel) can store any of
> these characters and leave the hairy aspects of language-specific
> intepretation to applications. If you want to do language
> tagging in the Linux kernel itself, just so you can continue to
> use your beloved koi-8 character set, you're introducing a huge
> amount of potential bloat that doesn't belong there.

If UTF-8 will be a requirement, I will be *unable* to pass koi-8 string
to UTF-8 interface. Check the standards -- koi8 string will be rejected
because it isn't valid UTF-8, and can't be converted to any other
representation of Unicode. Not every null-terminated sequence of bytes is
valid UTF-8 string, so declaring that UTF-8 *can* be used over transparent
interface is not the same as declaring that UTF-8 *must* be used, or
conversion between Unicode representations will assume that input is in
UTF-8.

--
Alex

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu