Re: unicode (char as abstract data type)

Matthias Urlichs (smurf@work.noris.de)
21 Apr 1998 11:52:14 +0200


Alex Belits <abelits@phobos.illtel.denver.co.us> writes:
>
> To convert what? We have multiple encodings because we have multiple
> languages, and conversion through Unicode is useful only within the

So you want to have a new encoding for every language.

That's your privilege, but other people (like me) want to have _one_
encoding for _every_ language.

> language because otherwise there will be nothing to map into. koi8-r and
> iso8859-1 charsets have no common characters except the 7-bit ASCII range.
>
So don't map and display as '?' or transliterations, then. If you have an
8-bit terminal, you have no other option anyway, so that's a null argument.

> > If libc can use UCS2 to call the kernel, then the kernel
> > only needs to perform half of the conversion and libc won't
> > need to convert back to UCS2. Put more of it in user-space!
>
> That will work only if absolutely everything un userspace uses Unicode
> or always has charset information available for every string at the time
> it is passed to kernel. None of these two situations exist in reality.
>
Currently you have a default, which is presumably KOI8-R, and you cannot
enter or display anything which isn't. Which is fine for your personal
KOI8-R island, but since when did that word describe the whole world?

That's what people are trying to do with UTF-8, have one system which can
be made to display all the world's file names. For instance. You can't do
that if you need a per-filesystem (per-file, ultimately) attribute of "which
encoding is _that_ supposed to be"; this gets Very Very Ugly. Look at Linux
2.1.xx or 2.0.34 to see exactly how ugly.

> I have never seen users voluntairily using different encodings of the
> same language on the same OS -- originally multiple encodings for the same
> languages were created because of incompatible operating systems and
> hardware. The real problem is, what will happen if user uses

Try the Japanese.

> language. And please, don't tell me that every program will be able to
> label charser before it writes -- I will like to see, what will convert
> encodings in
>
> ls -l >> "`ls | head -1`"
>
That's it EXACTLY, you're just using the wrong word here. You want a way
NOT to label ENCODINGS. The idea behind Unicode is to get rid of the idea
"we have to label encodings". The orthogonal idea "we have to label
charsets" is untouched by any of this, you'll need a way to distinguish
Helvetica from Times (or the KOIR-8 or Hiragana or Hebrew or ...
equivalent) no matter which encoding you end up using. Ditto for language
(do I need to run this text through the English or the German spellchecker?).

> Users don't bring encodings with themselves. In Russia even at the time
> when every desktop PC with DOS was incapable of displaying anything but
> cp866 encoding because of pseudographics in IBM charset, all email between

In other words, you had two 1:1 encodings which happily mapped onto each
other reasonably transparently.

If anybody in between had tried to interpret these characters, instant
chaos.

Guess what? Unicode says you get your 1:1 encoding even if an Arab
terrorist filters out all the Hebrew characters/encodings/whatever on the
way. Or if, more to the point, somebody filters out graphics characters
(they must be some secret code, after all, and on Fidonet you're not
allowed to use secret codes, so there!).

Note I'm being _slightly_ sarcastic here.

> If the application will know, what charset it is using. Right now
> parts of my xemacs are still under impression that I use iso8859-1
> charset, however the default fonts are in koi8-r, and things work just
> fine. One can say that xemacs could be designed better, however there are

You're again mixing up charsets and encodings.

> > Option 1: compile that knowledge into libc
> > Option 2: use an environment variable that libc interprets
>
> Yes, then smart MIME parser will know one thing and even smarter libc
> will know something completely different.
>
Yes, but if you use the same environment variable which these two actually
have to adhere to, because frankly if they don't nobody will use them at
all, you get KOI8-R output from your MIME parser and the libc translates
those parsed file names into UTF-8, which is exactly what you want. So
what's the problem?

> > You don't use charset labeling on your filenames, do you?
>
> Because I don't use non-English filenames now. However I _do_ use
> non-English headers in email, and they are separately charset-labeled, as
> well as message body or message body parts.
>
No they're not, they're encoding-labeled. See above.

> Unicode is supposed to be used by people who don't and can't use Latin1.
> Myself included.

Nope, Unicode is supposed to be used by _everybody_, including these damn
ASCII freaks in the US and these ISO-whatever-1 junkies in Western Europe.

You can bet that sometime soon I, even though I don't really directly
benefit by any of this, _will_ make an effort to UTF-8ize everything on
this system I can get my grubby hands on. Not just to be different, but to
make the tools available for these people who _really_ need them.

> No, it isn't. Kernel just uses wide characters, and no one in userspace
> seriously relies on that. If one will try to use Unicode, countless things

The kernel doesn't use wide characters. Currently it uses whatever you
throw at it, except that any encodings which use null bytes or slashes for
anything other than their intended meaning are Forbidden By Law.

> Look, _how_ they use it. Try to find anything in Solaris Unicode-based
> except text-converting utilities.
>
So? As soon as the market is there (and "how come Linux can have a
multicharacterset aware Emacs which Just Plain Cannot Work under Solaris
unless I port glibc to Solaris, at which point I can just as well install
Linux-Sparc64" might be a nice little piece of that market pressure) things
will happen.

-- 
Matthias Urlichs
noris network GmbH

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu