Re: unicode (char as abstract data type)

Alex Belits (abelits@phobos.illtel.denver.co.us)
Sat, 18 Apr 1998 16:51:43 -0700 (PDT)


On 18 Apr 1998, Kai Henningsen wrote:

> > On Sat, 18 Apr 1998, Martin Mares wrote:
>
> > > No, it just solves the storage and visual representation part of the
> > > problem and leaves the rest to the others.
> >
> > ...while after all necessary meta-information about language context is
> > gone, and there is no way to recover it except by guessing (one charset
>
> "Is gone"? It never was there to begin with. *No* character set that I
> have ever heard of has language labelling, so don't blame Unicode for not
> having it either.

If I have to label charset, I will label it with language, too (see
MIME), and this will actually give something. If Unicode is used, there is
"no need" for labeling, however all trouble of handling multibyte is
already there, so implementations still won't do labeling and will have
to handle multibyte. Relatively minor problem solved at the expense of
huge resources waste (Unicode fonts handling) and worsening others real
problem (language-dependent processing has no chance to get usable
labeling anymore). Labeling at the other hand, solves fonts handling
easily, but allows for language-dependent processing, in other words, does
"The Right Thing".

As for availability of labeling, one can use RFC-1522-style labeling
with added "8" encoding for raw byte stream (to "Q" and "B", defined in
RFC-1522) and declaring baskslash as universal escape character, and that
solves "multiple encodings in the string" problem instantly (like
"text in ASCII, =?koi8-r?8?text in λοι-8?= and =?iso8859-1?8?text in
iso8859-1?="). That creates way less problems, compatible with everything
in existence for texts that use single encoding per document, easy to
implement and is consistent with MIME and everything MIME-based or
MIME-using. And, of course, it requires absolutely no changes in anyone's
kernel.

> Most people bashing Unicode seem to do so for two extremely moronic
> reasons:
>
> * It's not a simple 8 bit character set. This is work!
>
> Well duh, there's more than 256 characters around. You _can't_ do this
> with a simple 8 bit character set. And if you want to see a really ugly
> solution, look at ISO 2022 - the "just give up" solution.

Don't put your words in my mouth. 8-bit charscters sets exist along with
multibyte ones, and both work. In the same applications, too. Unicode
"simplifies" that by creating one multibyte encoding that does *NOT*
provide features that any other encoding do, so regardless of having
single-byte or multibyte original encoding, change to Unicode decreases a
quality of all languages support.

Hint: quality of ASCII or iso8859-1 language support does not decrease
with Unicode, so you can't see it.

> * It doesn't solve problem X that no other character set solves either.
>
> Well duh. Quelle surprise. What else is new?

Not really. It requires amount of resources, that with proper
standardization coule be enough for real internationalized software,
and _still_ solves nothing except "pretty foreign letters in a document"
problem.

--
Alex

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu