Re: unicode (char as abstract data type)

Alex Belits (abelits@phobos.illtel.denver.co.us)
Mon, 20 Apr 1998 10:43:02 -0700 (PDT)


On Mon, 20 Apr 1998, Theodore Y. Ts'o wrote:

> Those who prefer charset labeling (such as Alex Belits)

s/Alex Belits/russian programmers/

> are worried
> about encoding efficiency (especially Alex who doesn't like the fact
> that Russian didn't get some of the 1 byte UTF-8 assignments.) This is
> certainly a consideration. However, charset labeling can get
> *extremely* complex and messy, especially if you want to store
> characaters from multiple character sets in the same document or
> filesystem.
>
> For example, suppose you have a dialup system with customers logging
> into your system from all over the world. Suppose further than the
> Russions what to use filenames with Cyrillic characters, the Chinese
> want to use the Han characters, the Europeans want to use ISO Latin 1,
> etc. Clearly, it's not sufficient to put a charset label in the
> superblock.

No one ever proposed that, so it can't be discussed.

> You need to put a character set label in every file, or
> perhaps even put some kind of escape sequence processing if you want to
> be able to support both, say, Kanji and ISO Latin 1 in the same file.

In those extremely rare situation where filename contains non-ASCII
iso8859-1 characters and and Kanji, adding escape sequences (that are used
in Kanji anyway) will cause the least amount of trouble.

> Worse yet, if you want to display such characters, you now need to tell
> the console how to interpret the application-specific escape sequences.
> None of the charset labelling folks have defined a universal escape
> sequence for changing between charsets; fundamentally, they assume that
> all processing on a particular machine will be done in a single
> character set.

I have mentioned MIME escape sequence ("=?") that has better reasons to
be used rather than Unicode (at least backward compatiblility).

> This gets problematical as soon as you observe that many
> documents need to support characters in multiple character sets, and it
> completely breaks down in client/server applications where people may be
> communicating over the network in multiple languages.

Set of existing non-plain-text documents standards is a part of status
quo that can't be changed by simple banning "wrong" formats and replacing
with one, so again, it isn't something that we can discuss. We are
discussing the possible encoding for text-only files content and
filenames.

In any case if a file is so language-specific that it has a name in that
language, it makes little sense to be viewed by a person who has no
knowledge of that language, and one who has such a knowledge, will use
the charset and language, defined per-user or per-application. In
the case where this is not enough, charset labeling of filenames can do
exactly the same thing, charset labeling of MIME headers does, and if one
worries about such situations becoming common, that can be declared to be
mandatory.

> There is also the backwards compatibility issue; how do you handle
> existing ext2 filesystems that are currently using ASCII, and a lot of
> existing code which assumes that the '/' and '\0' characters have
> special meaning.

Local charsets don't use '\0', and I am not sure if anyone uses '/', but
if one does, it won't be a great problem to declare encoding for it as
mandatory in filenames.

> For that reason, the only thing which makes sense for
> the ext2 filesystem is to declare that filenames and volume labels are
> in UTF-8.

...and break everything but ASCII, and forse all file content that can
contain filenames (say, Makefile) to be in UTF-8, too. I'll rather use
"=?".

--
Alex

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu