Re: UTF-8, OSTA-UDF [why?], Unicode, and miscellaneous gibberi=

Matthias Urlichs (smurf@lap.noris.de)
27 Aug 1997 11:48:54 +0200


Alex Belits <abelits@phobos.illtel.denver.co.us> writes:
>
> > because you can't put more than one language into one document -- try
> > mixing the German idiocy of usurping ASCII []{}\| for umlaut characters
> > with C source code -- oh, so you want to use trigraphs ???).
>
> This is why they have switched to iso8859-1.
>
iso8859-1 is a subset of Unicode (it just has different encoding). If we
can't get by with a single-byte encoding any more, fine, switch to
multibyte encoding with UTF-8 or whatever. Doesn't really make a difference
to me. Existing software can deal better with single-byte characters,
true, but hopefully that'll change (slowly).

> > There's also the question of what you want to achieve. Shall capital letters
> > be wholly distinct? Ignored? Be used for some sort of secondary ordering?
>
> I don't want to anythin but application to make this decision -- I had
> enough trouble with "smart" case-matching in DOS/Windows. Application
> should handle that and it should have _means_ to handle that easily.
>
Right, but that's not an issue of Unicode vs. anything else..?

> "generic mechanism" != "single charset". Generic mechanism can include
> charsets and have nice way of handling them (X11 has one, suitable for its
> user interface needs -- nothing prevents to make generic mechanism that
> uses charsets definitions that will contain proper procedures of handling
> more complex aspects of charsets and languages). Of course, that should be
> done entirely in userspace, and kernel shouldn't interfere.
>
You cannot do the distinction of "is this file name Cyrillic or Hebrew or
what" in userspace. Somehow the kernel must have a clue what it is so that
userspace can do the right thing. The easiest way to get that clue from the
kernel is to encode the filenames with UTF-8. Whether userspace then uses
Unicode or Big5 or iso8859-461678 is a different matter entirely.

> > There are also alternatives which won't work. Inserting the disk from
> > my Greek friend into the Russian friend's disk drive and having the
> > filenames show up in some jumble of nonunderstandable Cyrillic letters is
> > Not An Option
>
> Why? Will he be able to read them any better?

You're assuming that the Russian doesn't know any Greek and the correct way
of displaying the names (i.e. i'with Greek letters) would never be of any
use to anybody who's not in Greece and/or didn't configure his computer for
"Greek filenames, please".

I'd like to have kernel support which does _not_ presuppose this.

> > (it gets worse with multibyte characters -- "sorry, but this
> > character doesn't exist in Klingonese, so you can't type it, thus you can't
> > open this file"
>
> What????? If a program has the same bytes in argument to open() as ones in
> filename, file will be opened, otherwise not.

Yeah, but how do you display a nonexistent (in your local character set)
multibyte sequence, and how do you type it? At least with Unicode, the
userspace program can see that the character set the file name uses isn't
supported right now and can offer an alternate solution, like UTF-7 or
whatever else gets the job done.

I'd say that a jumble of Roman characters is better than a smaller jumble
of random Chinese characters with the occasional black square (for
nonexistent characters) thrown in, assuming a Chinese gets a file with a
Hebrew name for instance.

> > Marking the disk as "On this disk, all names are Cyrillic" and another disk
> > as "Greek" and another as "Big5" and another as ... doesn't make sense
> > either. What shall a multilingual translator do, one hard disk per
> > language??
>
> Multilingual translator needs more powerful thing that Unicode anyway, but

I _know_ that. So what? The problem doesn't go away just because some
people think Unicode does bad job with it. All I maintain is that the
current jumble of local character sets, in the context of file names(!),
doesn an even worse job. Everything else Unicode may or may not do right is
NOT a kernel issue and doesn't belong in this list.

> > You can pick nits with Unicode all you like, but please, if you want to
> > replace it, offer us some alternative which actually can be made to work
> > for everybody and which isn't just another 80% (or even 99%) non-solution.
>
> Any solution should be as far from filesystem name handling as possible.
>
But it needs to _support_ file system name handling.