Re: UTF-8, OSTA-UDF [why?], Unicode, and miscellaneous gibberi=

H. Peter Anvin (hpa@transmeta.com)
27 Aug 1997 06:21:00 GMT


Followup to: <Pine.LNX.3.95.970826221538.16339A-100000@phobos.illtel.denver.co.us>
By author: Alex Belits <abelits@phobos.illtel.denver.co.us>
In newsgroup: linux.dev.kernel
>
> ...but since sorting is charset-dependent, I can always apply charset's
> local definition of sorting and case-mapping (or even phonetic matching
> in loose search, or language-dependent word-searching rules in keywords
> searching) if I know the charset. One can even write C++ class to
> handle such things automatically and derive charsets from it (or do it
> in any OO language but Java, or in plain C if one wishes). With bare
> Unicode I simply can't do that unless I convert things back.
>

B*llsh*t. Sorting is not charset dependent, is is *LANGUAGE*
dependent. It is not the same thing.

The Swedish word "smörgåsbord" is borrowed into English. Therefore,
you cannot tell from looking at it, or its encoding, whether or not it
is in English or Swedish. However, in Swedish "ö" sorts after "z",
"å" and "ä", but in English it sorts between "n" and "p".

Furthermore, in Swedish, "ü" sorts like "y", between "x" and "z", but
in German it sorts after "z", "ä" and "ö". In Swedish "w" sorts like
"v", between "u" and "x", but in English separetly between "v" and
"x".

Until recently, in Spanish, the *two* letters "ch" sorted like a
separate letter between "c" and "d".

You *can't* do this in any existing multilingual charset, be it Latin,
Cyrillic, Greek, Arabic, Han, or IPA.

-hpa

-- 
    PGP: 2047/2A960705 BA 03 D3 2C 14 A8 A8 BD  1E DF FE 69 EE 35 BD 74
    See http://www.zytor.com/~hpa/ for web page and full PGP public key
Always looking for a few good BOsFH.  **  Linux - the OS of global cooperation
        I am Baha'i -- ask me about it or see http://www.bahai.org/