Re: [OFFTOPIC] Re: unicode (char as abstract data type)

Ton Hospel (thospel@mail.dma.be)
22 Apr 1998 23:25:16 GMT


In article <Pine.LNX.3.96.980422032357.8432D-100000@phobos.illtel.denver.co.us>,
Alex Belits <abelits@phobos.illtel.denver.co.us> writes:
> On Tue, 21 Apr 1998, Pavel Machek wrote:
>
>> Really, you can not determine charset from language - because language
>> of *my* emails is sometimes something between czech and english. And
>> now imagine, me wanting to write russian word sabaka (or how is dog
>> written). I of course want to write it in azbuka. And I do not want to
>> tell my text editor origin of each word I use.
>
> If it will attach labels to character sequences, it will know them.
> English, being a language, supported as ASCII subset in non-ASCII charset
> can be used without separate labeling, but again, if necessary, switch
> between languages can be reflected in labeling.
>
>> So it is hard to impossible to gather info about language. User just
>> will not want to tell you. User may even want to write Geek with 'G'
>> in azbuka. Why not?
>
> Information about language is always lost if there is no place to put a
> label. Like, in Unicode.
>
>> I believe that language labeling cannot handle
>> that.
>
> Why? One can make labels as detailed about the language as necessary --
> labeling assumes the extensibility of the labels set, and matching will
> still work because label contains charset name. See MIME for example of
> labeling (not for example of 7-bit encoding, it's unnecessary).
>
> --
> Alex

Oh please, let it die.

A filename is just a sequence of bytes. Since we like to list files and people
like to see/use the weird symbols they have in their favourite character sets
we want to extend a filename from a sequence of bytes to a sequence of glyphs.
(they might also have some filesystems were they think of the filenames as
a sequence of to them familiar characters that they want to access from linux),
Unicode allows us to have more than 256 glyphs that we can use at once, and
UTF8 is just a convenient encoding so we can still use all this old stuff
everywhere (and we don't waste too many bytes since much old stuff, e.g. ftp
sites, have names that are byte sequences), where we can continue seeing / and
\0 as the only special characters

We didn't have language tagging before or after, we just gained an extension of
our usable glyph set.

Think e.g. of how unix sees files as streams of bytes, though at the time
record based filesystems were popular. The idea is that the kernel should
just provide a simple model, your apllication programs do things like
interpreting \n as a record separator in text files (while it means no such
thing to a byte-interpreter working through a byte compiled program).

Same thing with internationalization. The applications must think of how they
do things like language tagging, line breaking, direction of display etc.
Unicode just enables these programs by giving them a sufficiently rich glyph
set to play with. You are basically trying to push an application's
responsibility into something thats just a byte sequence<->glyph sequence
convertor. Sure, we could try to do that, but we shouldn't. Wrong abstraction.
Unicode is an enabler of i18n, not an i18n method.

Discussion of what is a good encoding of the glyphs in the kernel and how we
best map filesystems with built in codepage/character sets to and from our
brand new glyph universe in the kernel belong here.
Discussion of why you think unicode is a misguided solution to a problem that
it was not trying to solve does not.
(maxim of the day: unicode is not the solution, but it's also not the problem)

-- 
My pid is Inigo Montoya.  You kill -9 my parent process.  Prepare to vi.

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu