Re: UTF-8, OSTA-UDF [why?], Unicode, and miscellaneous gibberish

Alex Belits (abelits@phobos.illtel.denver.co.us)
Mon, 25 Aug 1997 23:27:17 -0700 (PDT)


On 26 Aug 1997, Kai Henningsen wrote:

> > So is BASE64, uuencode and dump in octal. That doesn't make them more
> > acceptable.
>
> That turns out not to be the case. (Actually, both HPA's and yours.)
>
> UTF-8 is easily expandable to 2^36, which is a lot more than what we might
> need in the forseeable future, even if we happen to make contact with
> several million alien species using as many characters as we do.

..assuming that alien species use glyph-based charsets and don't have any
advantage of having fixed-length characters at least inside the language.

>
> None of these is infinitely expandable. Not that it matters. They already
> allow ridiculous numbers.

If you don't get it, explanation: the mere possibility to translate
something somewhere doesn't make it usable.

> Except, that is, that base64, uuencode, or octal don't specify any
> character set definitions (they're just ways to represent any odd binary
> data), and UTF-8 does.

..and in addition to being variable-length inside one charset, it
specifies the ugliest charset possible.

> > There is nothing nationalistic in distinguishing between similarly-look
> > ing
> > characters that belong to different languages, have different meaning
> > and usage and may be written/typesetted differently. No one proposed to
> > make cyrillic "á" and Roman "A" the same character, even though when I
> > write this, both of them look exactly the same -- why others should do
> > that?
>
> Very simple reason - the round trip principle. On designing Unicode, the
> rule was that for every ISO standard existing at that time (and some
> vendor standards, too),
> 1. every character in that standard should go into Unicode
> 2. It should be possible, without losing any information, to translate
> text in that standard into Unicode and back again
> 3. Otherwise, characters should be unified

That loses information about original language that is preserved in all
charset-labeling systems, and gives nothing in exchange (except the
compatibility with *your* iso8859-1 and lack of compatibility with
everything else.

> Anyway, the Han Unification was done by the East Asians themselves.

Just like Holocaust by Jews.

> As to typesetting differences, that's not what Unicode is about.

It creates trouble for processing that is used in typesetting (as opposed
to charset labeling).

> As to
> languages, yes, there _is_ unification of characters from different
> languages even in the latin part of Unicode.

And, _please_, don't unify my language with yours until I've asked you for
that.

--
Alex