Re: UTF-8, OSTA-UDF [why?], Unicode, and miscellaneous gibberish

H. Peter Anvin (hpa@transmeta.com)
27 Aug 1997 16:15:03 GMT


Followup to: <199708271226.VAA10421@megatherium.mri.co.jp>
By author: NIIBE Yutaka <gniibe@mri.co.jp>
In newsgroup: linux.dev.kernel
>
> H. Peter Anvin writes:
> > Trivial. Pick a range out of the *thousands* of private-use planes in
> > UCS-4, and map your character set(s) onto them. Then encode the whole
> > thing in UTF-8. Done.
>
> Yes. But I'm afraid that we discuss other things each other here.
> I hope we could share some ideas and experiences. My point is that
> the needs of handling multiple character sets (simultaneously).
>
> In the naive approach of using private-use planes, some problem can be
> solved, yes, each person can use his/her own character set(s).
> However, speaking of information interchange, we have to send
> information about the character set itself along with text.
> Then, it seems for me that it's multiple character sets system in fact.
>
> Besides, I'm afraid that using UCS-4 in such a way, some people think
> it's abuse of UCS-4. If it's not problem, standarization of handling
> multiple character sets in UCS-4 is the way to go.
>

I guess I don't understand what you are talking about here. All I'm
saying is that if you have a character set which is not supported by
ISO 10646, there is plenty of space in UCS-4 to map it. Basically
you're using the codepoint, say, U+000F0000, to mean "character 0 in
myspiffycharacterset-1". Then you are carrying along the information
of which character set it comes from.

The other alternative is to use a stateful encoding like ISO 2022.
Stateful encodings are particularly bad for short strings like
filenames, and you can't extract substrings from them. Painful.

Note that in general, this is a Bad Thing[TM]. For example, I do not
want different code points for the letter "A" from ISO 8859-1 and ISO
8859-3. Since they are the same character, unification is a Good
Thing[TM].

-hpa

-- 
    PGP: 2047/2A960705 BA 03 D3 2C 14 A8 A8 BD  1E DF FE 69 EE 35 BD 74
    See http://www.zytor.com/~hpa/ for web page and full PGP public key
Always looking for a few good BOsFH.  **  Linux - the OS of global cooperation
        I am Baha'i -- ask me about it or see http://www.bahai.org/