Re: UTF-8, OSTA-UDF [why?], Unicode, and miscellaneous gibberish

Alex Belits (abelits@phobos.illtel.denver.co.us)
Tue, 26 Aug 1997 13:11:35 -0700 (PDT)


On Tue, 26 Aug 1997, Michael Poole wrote:

> As a preface: Alex, your rants aren't going to convert very many
> people to your side; while not very many people (for or against Unicode
> support .. where? in the kernel? is that what this is about?) are
> arguing in depth with facts, the volume of your vitriol exceeds others'.
>

People that support Unicode here do not present any facts except that it
works for them. Since Unicode is based on their native charset, I will be
surprised if it was otherwise. They claim a lot that Unicode is "accepted"
(it is not), "universal" (it is not) or will be better for our own good,
etc. I don't think, many people here will bother to check if their claims
are true, unless someone who really deals with real
non-iso8859-1-supported language in everyday life will explain, how much
trouble it will cause for those who were supposed to benefit from Unicode.

> The main point I'd like to make, though, is this: this is the
> linux-kernel mailing list; we should try to restrict ourselves to
> discussions pertinent to the kernel.

If any charset is declared as the only possible for the filesystem, and
kernel will "assume" so in its operations, it will cause countless
problems for people, who will try to keep using their encoding in
userspace. This is why it's very relevant.

> A general debate about the
> advantages of Unicode or some other character encoding doesn't need to
> take place; the only reason to discuss character set encodings on
> linux-kernel are to decide what the kernel should use.

It should not "use" charsets. It should allow user to use charsets that
he, user (or userspace programmer) will choose for his task. Kernel is
wrong place to enforce charset restrictions, and all filesystems but NTFS
and FAT are already used with all charsets but Unicode -- but NTFS because
NTFS has hardcoded Unicode and but Unicode because all other charsets
already were designed suitable to be used with filesystems directly, and
only Unicode needs UTF-8 representation for that purpose.

> For simplicity's
> sake, the kernel encoding should have certain features (which I discuss
> below), which are generally not provided by native encodings.

[skipped]

> >
> > > > It simplifies issues for GUI-writers and creates a nightmare for everyone
> > > > else. Of course, Microsoft doesn't care about anything but GUI, but I do.
> > >
> > > Of course, _nobody_ has presented the slightest shred of evidence that it
> > > creates nightmares for anybody,
> >
> > ...If that "anybody" speaks English and German.
>
> You posted earlier a question asking for someone's authority to
> make an assertion which is almost the opposite of what I quote above.

False. If you misunderstood the line above: Unicode is just fine for
English and German, French and Spanish. So is iso8859-1. It's a trouble
for others.

> However, you don't bother to explain why you have authority to complain
> about lost quality.

Because I speak Russian. Russian is beyond "comfortable"
iso8859-1-compatible range in Unicode. German and French are in that
range. Therefore German-speaking person explaining, how wonderful is
Unicode for him, can't claim that others have to like Unicode as much as
he does. OTOH, my language while not in "comfortable" range in Unicode, is
still single-byte glyph-based, so I can't (and don't) speak for people who
have real multibyte encodings because their languages and writing systems
demand so.

> For the problems with supporting wide-character encodings in
> userspace, I'll refer any interested parties to the debate/flame war which
> Alex Belits and I and several other people participated in earlier this
> year on comp.os.linux.development.system, but this is the *linux-kernel*
> list, and user-space issues generally aren't relevant here.

It will be fine if those issues will be handled in userspace. But kernel
should be charset-neutral except in handling devices and filesystems with
hardcoded unchangeable charset. Mandatory userspace translation of
charsets for single-charset kernel is unacceptable in situations where
kernel can be just transparent.
>
> In the kernel, I think the decision that needs to be made rests on
> these points:
> * How efficient is it in terms of encoding text?
> - For the present, this 'text' is going to be almost entirely ASCII,
> since AFAIK the kernel doesn't involve itself with text inside files.
> - In the future, this may change for some users.

All native charsets allow ASCII fallback in their native form, without
UTF-8-like conversion.

> * How small is the source and binary code which handles operations on the
> text which the kernel needs to do?

If kernel won't mess with names in other charsets, there will be 0 bytes
of code to handle them.

> - For the most part, the kernel just needs to handle iteration over
> characters (in both directions?)

...and since there are no zeroes in native encodings, it can just assume
that all data is 8-bit -- userspace will treat sequences as multibyte
characters, like it does for a long time now.

> - Console output is an issue -- currently it only supports 256 or 512
> characters on PCs, but if something like GGI becomes prevalent, most
> users will expect it to support the full range of supported characters.
> This means that to display Han characters you'll need a large bitmap
> table containing them, but this can be arranged to be swapped out.

X encountered that problem long ago and solved it just fine. There is no
need to invent another wheel, incompatible with existing one.

> - Input is another issue, but I don't feel qualified to comment on it;
> I don't have any idea how it's currently handled or how
> foreign-language input methods generally (or "should") work.

X does that. If anyone needs that on console, the same concept of
entirely-userspace configurable input methods can be used.

> > > and there is some evidence that it
> > > actually eliminates such nightmares (such as supporting two dozen
> > > different, incompatible character sets, maybe with an abomination like ISO
> > > 2022 as "solution", and not covering half as much territory).
> >
> > Clay tablets support more.. Let's switch to clay tablets.
>
> Gee, that'll make it kind of hard to store characters on disk.
> "I've got a 1,000,000-tablet SIMM here, how much do you want for it?"
>
> I've yet to see you argue what encoding(s) should be used, or even
> what features they should have, but you seem to be convinced that "native"
> (status quo) encodings are better than anything new. Here are my
> arguments on why we need something like Unicode or UTF-8 support in the
> kernel, as a list of the features required:
> * Unambiguous encodings of distinct characters within a language

...that still requires language to be identified unless you are at the
end-output device only. Native charsets with such identifiers are juat as
unambiguous.

> * Relatively easy to find the begin and end of characters (not loads of
> state), since it's bad to store fractional characters eg in a filename

Native encodings are designed that way already, even though it's a moot
point in the filename anyway -- it has _very_ clear beginning.

> * A single encoding should be used for all character sets -- you wouldn't
> want to have to make guesses about the character set something is in,
> and thus possibly misdisplay or mishandle the text.

I prefer to misdisplay data rather than to lose it. Conversion to Unicode
with the loss of metainformation loses information.

> Native encodings don't provide this -- Latin-1 and Latin-2
> conflict on some characters, and BIG5 and JIS conflict with each other and
> Latin-1 and Latin-2 on more characters.

If there still is a need in metainformation, Unicode and native encodings
are equally unambiguous. Loss of metainformation will instantly cause
mishandling in complex text-processing operations.

> I've heard that begin-end detection in certain Far-East encodings
> is impossible to do unless you start from a known string beginning as
> well, but I don't know the details of that.

Again, in kernel you _do_ have clear begin and end always (not that kernel
cares for it). I agree that for userspace it will be better to make some
generic format that will provide cleaner metainformation handling, but
just throwing it away and keeping Unicode is worse.

> > > Plus, there's no reason why GUI writers should profit more than anybody
> > > else.
> >
> > You really don't know the difference between buttons-drawing and text
> > processing in databases?
>
> I'll point out two things:
> * that both of these are problems in user-space, not in the kernel

restriction placed in kernel will cause userspace either to accept it or
make complex and inefficient conversion every time, filename-related
syscall is made.

> * you don't provide any support for your argument

False.

--
Alex