Re: UTF-8, OSTA-UDF [why?], Unicode, and miscellaneous gibberish

Darin Johnson (darin@connectnet.com)
Tue, 19 Aug 1997 12:54:57 -0700 (PDT)


> From: Alex Belits <abelits@phobos.illtel.denver.co.us>

> AFAIK, Chinese, Japanese and Russians _oppose_ Unicode that is mostly
> pushed by people who use iso8859-1 anyway, and thus have trivial mapping
> between their native charset and Unicode.

Yes, I've found this to be true anyway. They've already got code
systems that work just fine, and existing sets of tools that work
fine, and converting all the legacy apps is almost impossible. If
you've already got code that handles multibyte characters, unicode
would be just another encoding to complicate things (ie, why would
they use a unicode locale instead of a sjis locale).

(although UTF-8 is a nicer encoding to deal with if you don't have a
huge library of internationalized string routines; ie, you can do
strtok with 7-bit delimiters and it'll work just fine, whereas sjis
might split the token in the middle of a multibyte char)

A big advantage of unicode though is that it handles more than one
language at a time. There are very few programs that deal with that
situation, most assume everyone is monolingual, or if multiple
languages are required then latin-1 suffices. This means more than
just ascii based English plus one other language. Compare Linux,
where I can use mule to edit multiple languages, but under Windows NT
I have to install a new OS for each language (even though has
unicode). The advantage of unicode is that you can have English,
German, Russian, Greek, Japanese, and Arabic, all in one file and with
one encoding. To that end, I suspect most people don't see the
utility for a program to be multilingual; since most just want
their own local language supported.

Unicode *could* make internationalization easier; but only if everyone
switches to that encoding. The problem is that not everyone is
willing or able to switch; thus, even if you support unicode, you
still need to support all those encodings your customers actually use.
(thus unicode may have been a great solution if introduced in the 50's
or 60's)

Note that even though ext2 doesn't really care what format a file name
is in, at the external level you can't tell what encoding was actually
used. Thus, if you've got some filenames that are Latin-1, some that
are EUC, and some that are BIG-5, things are going to get confused if
you have no way of telling what format goes with which file. The
solution to that is either explicitly have a standard encoding or have
each file marked with the encoding it used. Right now the situation
is that you can't change locales in the middle of an 'ls'.

I also feel that the biggest importance of unicode is for it to be an
*internal* data format, not necessarily something that gets seen
externally. Ie, you convert all multibyte data on input, and then
process it internally as wide characters.