Re: UTF-8, OSTA-UDF [why?], Unicode, and miscellaneous gibberish

Peter Holzer (hjp@wsr.ac.at)
Wed, 20 Aug 1997 12:43:28 +0200 (MESZ)


Alex Belits wrote:
>On Wed, 20 Aug 1997, Erik Corry wrote:
>
>> Unicode is regularly extended, and is incredibly complete in
>
>...by a commitee.

Yes, of course. By who else?

>And they don't release free implementation of
>it or updates to existing ones after that.

This is a problem. And last time I heard the standard was only available
on paper, which is not the best format for something which consists
almost completely of tables.

>> What
>> more could you want?
>
>Japanese and Chinese characters encoding that Japanese and Chinese people
>use, perhaps?

Unicode does include Japanese and Chines characters. Some may be
missing, of course, but they can (and should) be added.

>> Linux has already standardised on UTF-8 for the console.
>
>(looking at the console...) No, still looks like koi8-r for me... Having
>the internal support doesn't mean that it's usable enough to make it
>mandatory everywhere.

Same here. At least on 2.0.30 (haven't any 2.1.x kernel running at the
moment) the console is straight Latin-1, not UTF-8 (at least pressing
the "ö" key gives me the single code F6, not C0 B6. And printing C0 B6
to the console prints "À¶", not "ö". The escape sequences in unicode.txt
don't switch to Unicode, either.

>
>> The
>> suggestion of converting all file systems to a single
>> encoding is probably a useful one, and should probably
>> available as a (default?) mount option.
>
> It should be possible to _choose_ mapping as the mount option, not
>"UTF-8 or all filenames will be truncated to the first letter because
>second one is zero".

You are mixing up 16-Bit Unicode and UTF-8 here. In UTF-8, Unicode
characters 0000 to 007f are mapped to single bytes with the same value.
All other codes are mapped to multi-byte sequences where all bytes have
the MSB set.

>I'm not aware of any development of Unicode-using tools. And unless
>sh / bash / grep / awk /... will work with UTF-8 as with native characters
>(that means, variable-length-encoded character is treated as one
>character, and what I don't think, anyone will make any soon), no one will
>use it for anything decent.

The good news about UTF-8 is that most things will "just work". The bad
news is that quite a lot of programs must be fixed to work properly in
all cases. For example "grep ä foo" will find exactly the lines with one
or more characters "ä" in it, even though that is represented by two
bytes. Similarly "grep (ä|ö|ü) foo" will find the lines with "ä" "ö" or
"ü" in it, but "grep [äöü] foo" will not. It will find a lot of other
characters, too, unless grep (or rather regex in libc) knows about
UTF-8. Similarly all programs which count characters (wc, less, vi, ...)
must be adapted to handle multibyte characters. But this is true for all
character sets with more than 256 characters.

hp

--
   _  | Peter J. Holzer             | If I were God, or better yet
|_|_) | Sysadmin WSR                | Linus, I would ...
| |   | hjp@wsr.ac.at               |     -- Bill Davidsen
__/   | http://wsrx.wsr.ac.at/~hjp/ |        (davidsen@tmr.com)