Re: unicode (char as abstract data type)

Martin Mares (mj@atrey.karlin.mff.cuni.cz)
Sat, 18 Apr 1998 00:33:34 +0200


Hi,

> 1. While doing absolutely nothing for real language-dependent operations
> on text without clumsy hacks around it, it makes things seem
> "internationalized" when they are not actually usable for non-iso8859-1
> languages. However it creates more trouble for a programmer than any
> multibyte encoding for anything except the most trivial things.

I think it doesn't. In many programs, you rarely need to touch individual
characters in a string (as an example, about 25000 lines of source of my
universal full-text search engine were modified to fully handle UTF-8 in
about 5 hours, having already written a library for UniCode handling).

> 2. It's based on Unicode, the standard, widely opposed everywhere except
> English-speaking countries (whose opinion doesn't count, especially on
> UTF-8 that is binary-indistinguishable from ASCII in ASCII characters
> range) and Western Europe (for what Unicode is specifically accomodated).
> For example, all Russian programmers (and me among them) that I have seen
> or heard, consider that "standardization" as an equivanent of spitting
> into their face.

I don't know exact situation in Russia, but in Czech Republic we probably
have even worse problems as there are more than five widely-used character
sets used to encode Czech accented characters and I consider UTF-8 to be
the best solution I've ever seen (if you use any single-character eight bit
encoding, you've probably missed some characters you need; if you decide
to use 16-bit characters, you inflate your files a lot).

> 3. It tries to avoid the unavoidable -- multilingual text processing
> must use some kind charset _and_ _language_ labeling to do things well and
> consistent with complex and diverse nature of human languages. While
> labeling is obviously quite a pain in itself, it's 1. can be easily
> extended, 2. can use existing localizable or localized software, 3. used
> and standardized in MIME, even though in a way that needs to be extended
> to be applicable for documents that contain multiple languages, 4. With
> reasonable effort that does not involve modification of existing software
> and configuirations can interoperate with everything that exists now if
> such interoperation is possible at all, 5. Is necessary to
> non-text-display-oriented processing (phonetic match, speech generation,
> statistical text analysis) and even high-quality multilingual
> typesetting.
>
> Unicode just sweeps the dust under the carpet, pretending that the
> problem is limited to the storage and visual representation of
> multilingual text.

No, it just solves the storage and visual representation part of the
problem and leaves the rest to the others.

Have a nice fortnight

-- 
Martin `MJ' Mares   <mj@ucw.cz>   http://atrey.karlin.mff.cuni.cz/~mj/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
"God is real, unless declared integer."

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu