Re: unicode (char as abstract data type)

Alex Belits (abelits@phobos.illtel.denver.co.us)
Tue, 21 Apr 1998 09:49:51 -0700 (PDT)


On 21 Apr 1998, Matthias Urlichs wrote:

> Alex Belits <abelits@phobos.illtel.denver.co.us> writes:
> >
> > To convert what? We have multiple encodings because we have multiple
> > languages, and conversion through Unicode is useful only within the
>
> So you want to have a new encoding for every language.

Old encoding. Very old encodings, used everywhere already.

> That's your privilege, but other people (like me) want to have _one_
> encoding for _every_ language.

Some people wanted also one language and one government. Others didn't
like that, either.

> > language because otherwise there will be nothing to map into. koi8-r and
> > iso8859-1 charsets have no common characters except the 7-bit ASCII range.
> >
> So don't map and display as '?' or transliterations, then. If you have an
> 8-bit terminal, you have no other option anyway, so that's a null argument.

No, I can display it in whatever bytes that will map into, and still
keep the distinction between distinct strings. Or in backslash-octals
notation like emacs did. I am sick and tired from stupid programs that
convert[ed] everything non-ASCII to '?' in userspace already, and have
absolutely no need for that in kernel.

> > > If libc can use UCS2 to call the kernel, then the kernel
> > > only needs to perform half of the conversion and libc won't
> > > need to convert back to UCS2. Put more of it in user-space!
> >
> > That will work only if absolutely everything un userspace uses Unicode
> > or always has charset information available for every string at the time
> > it is passed to kernel. None of these two situations exist in reality.
> >
> Currently you have a default, which is presumably KOI8-R, and you cannot
> enter or display anything which isn't. Which is fine for your personal
> KOI8-R island, but since when did that word describe the whole world?

I can live if that will still be a thing, supported everywhere, and
everything new multilingual will add charset labels -- old software will
still work, new multilingual one will work, too, and the worst thing, one
can expect will be charset-switching sequence, interpreted as a text in
some program -- but then it will be passed transparently and won't be
harmed.

> That's what people are trying to do with UTF-8, have one system which can
> be made to display all the world's file names. For instance. You can't do
> that if you need a per-filesystem (per-file, ultimately) attribute of "which
> encoding is _that_ supposed to be"; this gets Very Very Ugly. Look at Linux
> 2.1.xx or 2.0.34 to see exactly how ugly.

Again, it has nothing to do with kernel, and kernel has no business
messing there in the first place.

> > I have never seen users voluntairily using different encodings of the
> > same language on the same OS -- originally multiple encodings for the same
> > languages were created because of incompatible operating systems and
> > hardware. The real problem is, what will happen if user uses
>
> Try the Japanese.

AFAIK, they convert multiple encodings to one, they can handle, and use
it.

> > language. And please, don't tell me that every program will be able to
> > label charser before it writes -- I will like to see, what will convert
> > encodings in
> >
> > ls -l >> "`ls | head -1`"
> >
> That's it EXACTLY, you're just using the wrong word here. You want a way
> NOT to label ENCODINGS. The idea behind Unicode is to get rid of the idea
> "we have to label encodings". The orthogonal idea "we have to label
> charsets" is untouched by any of this, you'll need a way to distinguish
> Helvetica from Times (or the KOIR-8 or Hiragana or Hebrew or ...
> equivalent) no matter which encoding you end up using. Ditto for language
> (do I need to run this text through the English or the German spellchecker?).

Helvetica is a font. Russian is a language. koi8 is a charset. Fonts are
irrelevant, but languages must be preserved for any sane
internationalization. And since labeling of languages automatically
provides a way to label charsets, there is no need in messy Unicode.

> > Users don't bring encodings with themselves. In Russia even at the time
> > when every desktop PC with DOS was incapable of displaying anything but
> > cp866 encoding because of pseudographics in IBM charset, all email between
>
> In other words, you had two 1:1 encodings which happily mapped onto each
> other reasonably transparently.
>
> If anybody in between had tried to interpret these characters, instant
> chaos.
>
> Guess what? Unicode says you get your 1:1 encoding even if an Arab
> terrorist filters out all the Hebrew characters/encodings/whatever on the
> way. Or if, more to the point, somebody filters out graphics characters
> (they must be some secret code, after all, and on Fidonet you're not
> allowed to use secret codes, so there!).
>
> Note I'm being _slightly_ sarcastic here.

No, you are just telling completely irrelevant things.

> > If the application will know, what charset it is using. Right now
> > parts of my xemacs are still under impression that I use iso8859-1
> > charset, however the default fonts are in koi8-r, and things work just
> > fine. One can say that xemacs could be designed better, however there are
>
> You're again mixing up charsets and encodings.
>
> > > Option 1: compile that knowledge into libc
> > > Option 2: use an environment variable that libc interprets
> >
> > Yes, then smart MIME parser will know one thing and even smarter libc
> > will know something completely different.
> >
> Yes, but if you use the same environment variable which these two actually
> have to adhere to, because frankly if they don't nobody will use them at
> all, you get KOI8-R output from your MIME parser and the libc translates
> those parsed file names into UTF-8, which is exactly what you want. So
> what's the problem?

Environment variable is global for a process. MIME provides charset for
every message part and header line. Conflicts can't be resolved without a
knowledge of MIME or per-string charset/language handling, or...
transparency at all levels except in MIME application.

> > > You don't use charset labeling on your filenames, do you?
> >
> > Because I don't use non-English filenames now. However I _do_ use
> > non-English headers in email, and they are separately charset-labeled, as
> > well as message body or message body parts.
> >
> No they're not, they're encoding-labeled. See above.

I will rather label them with =? than use completely different encoding
for names and content.

> > Unicode is supposed to be used by people who don't and can't use Latin1.
> > Myself included.
>
> Nope, Unicode is supposed to be used by _everybody_, including these damn
> ASCII freaks in the US and these ISO-whatever-1 junkies in Western Europe.

No, it was _developed_ by them, and now they are pushing it down others'
throats.

> You can bet that sometime soon I, even though I don't really directly
> benefit by any of this, _will_ make an effort to UTF-8ize everything on
> this system I can get my grubby hands on. Not just to be different, but to
> make the tools available for these people who _really_ need them.

Wow, I'm scared.

> > No, it isn't. Kernel just uses wide characters, and no one in userspace
> > seriously relies on that. If one will try to use Unicode, countless things
>
> The kernel doesn't use wide characters. Currently it uses whatever you
> throw at it, except that any encodings which use null bytes or slashes for
> anything other than their intended meaning are Forbidden By Law.

There aren't any nulls in those encodings anyway.

> > Look, _how_ they use it. Try to find anything in Solaris Unicode-based
> > except text-converting utilities.
> >
> So? As soon as the market is there (and "how come Linux can have a
> multicharacterset aware Emacs which Just Plain Cannot Work under Solaris
> unless I port glibc to Solaris, at which point I can just as well install
> Linux-Sparc64" might be a nice little piece of that market pressure) things
> will happen.

Bullshit. Solaris does not use Unicode for anything but
same-language charsets conversion utilities -- exactly what I propose.
Creating artificial incompatibility for marketing is something that I want
to see the least among Linux developers.

--
Alex

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu