Re: unicode (char as abstract data type)

Alex Belits (abelits@phobos.illtel.denver.co.us)
Fri, 17 Apr 1998 16:00:15 -0700 (PDT)


On Sat, 18 Apr 1998, Martin Mares wrote:

> Hi,
>
> > 1. While doing absolutely nothing for real language-dependent operations
> > on text without clumsy hacks around it, it makes things seem
> > "internationalized" when they are not actually usable for non-iso8859-1
> > languages. However it creates more trouble for a programmer than any
> > multibyte encoding for anything except the most trivial things.
>
> I think it doesn't. In many programs, you rarely need to touch individual
> characters in a string (as an example, about 25000 lines of source of my
> universal full-text search engine were modified to fully handle UTF-8 in
> about 5 hours, having already written a library for UniCode handling).

Yeah, full-text search will be just perfect. But just try to analyze
that text if it's really multilingual.

> > 2. It's based on Unicode, the standard, widely opposed everywhere except
> > English-speaking countries (whose opinion doesn't count, especially on
> > UTF-8 that is binary-indistinguishable from ASCII in ASCII characters
> > range) and Western Europe (for what Unicode is specifically accomodated).
> > For example, all Russian programmers (and me among them) that I have seen
> > or heard, consider that "standardization" as an equivanent of spitting
> > into their face.
>
> I don't know exact situation in Russia, but in Czech Republic we probably
> have even worse problems as there are more than five widely-used character
> sets used to encode Czech accented characters and I consider UTF-8 to be
> the best solution I've ever seen (if you use any single-character eight bit
> encoding, you've probably missed some characters you need; if you decide
> to use 16-bit characters, you inflate your files a lot).

Russian alphabet has 33 letters, each of them have uppercase and
lowercase, and none of them are in ASCII, however there is commonly
accepted semi-phonetic match between ASCII and cyrillic. koi8-r charset
takes advantage of that, and has cyrillic letters corresponding to
matching ASCII ones placed in upper half of the table right above
matching ASCII one. That allows to write Russian-specific text processing
easy and efficient, and as side effect, cyrillic texts remain readable
when passed through something high-bit-stripping (like some sendmail
configurations), and cyrillic letters don't look like control characters
to any "non-internationalized" software. While those properties are unique
to koi8 and Russian/Byelorussian/Ukrainian languages, abandoning them
for Unicode definitely doesn't look like a possible option. And despite
the existence of other charsets in Russia, all internet-related
communications were done in this charset since the very beginning of first
UUCP-based network there. The only, and sad, exceptions now are web pages,
posted through FrontPage -- it enforces "charset=cp1251", and thus
breaks every document that uses anything else.

> > 3. It tries to avoid the unavoidable -- multilingual text processing
> > must use some kind charset _and_ _language_ labeling to do things well and
> > consistent with complex and diverse nature of human languages. While
> > labeling is obviously quite a pain in itself, it's 1. can be easily
> > extended, 2. can use existing localizable or localized software, 3. used
> > and standardized in MIME, even though in a way that needs to be extended
> > to be applicable for documents that contain multiple languages, 4. With
> > reasonable effort that does not involve modification of existing software
> > and configuirations can interoperate with everything that exists now if
> > such interoperation is possible at all, 5. Is necessary to
> > non-text-display-oriented processing (phonetic match, speech generation,
> > statistical text analysis) and even high-quality multilingual
> > typesetting.
> >
> > Unicode just sweeps the dust under the carpet, pretending that the
> > problem is limited to the storage and visual representation of
> > multilingual text.
>
> No, it just solves the storage and visual representation part of the
> problem and leaves the rest to the others.

...while after all necessary meta-information about language context is
gone, and there is no way to recover it except by guessing (one charset
may be used in different languages). Visual representation isn't
exactly solved either -- one has to go great length to just determine the
set of fonts, necessary to display a document -- no one makes complete
Unicode fonts, so one still has to find all charsets involved.

--
Alex

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu