Re: UTF-8, OSTA-UDF [why?], Unicode, and miscellaneous gibberish

Darin Johnson (darin@connectnet.com)
Wed, 27 Aug 1997 12:11:29 -0700 (PDT)


Michael Poole writes:
> - Filenames should contain an integral number of characters, even if an
> app tries to write a filename where the NAME_MAX-1'th byte isn't the last
> byte in the wide character. This is debatable --

Hmm, it does seem somewhat valid, given that the user-space
application doesn't know how large a file name can be.

The drawback of course, is that you must choose one and only one
encoding. Some people strongly feel there is no single encoding to
choose that is adequate :-)

Why not hide all this inside a module? That way, you don't force
people to use unicode, they can plug in a different module if they
like (or a stub module for people that don't care).

> - Console input is something that the kernel must know about (and which I
> know relatively little about); having translation tables between scan
> codes and what gets sent to an app would provide a flexible way to handle
> arbitrary input.

Except this is handled today already, with X11 applications. And
undoubtedly it can be handled with console apps too at the user space
level. Use a pseudo tty, and have the console apps read from that
after the translation's been done.

The big problem is that if you stick this in the kernel, then it
becomes difficult for the user to modify it. Some input methods are
very complex, most of the Asian language input methods require
database support. Putting this all into a kernel is excessive, and
isn't necessary. For European languages, you can just have an
appropriate local keymap, for those times when the console must be
used (ie, X11 is broken, you're upgrading the system, etc).

At the most, the kernel should just handle a few more keys (to handle
unusual keyboards), and maybe allow multibyte output for complex
keypresses (ie, a compose sequence could result in a multibyte
character). But that wouldn't require specific knowledge of charsets,
only the keymap writer would need to know that. Everything else can
be user-space.

> - Console output is something else that the kernel must know about. My
> first-hand experience here is only with what currently runs on Wintel
> genre machines, but I think that remappable character bitmaps is about as
> good as we can get for the Wintel text modes.

I don't know how NT does it, but it does print Japanese when booting
up (as part of the "choose an OS" menu). Don't know if it uses bios
to do this or something else though (there's not enough os loaded to
do unicode either).

However, should the kernel handle this again? What if the only
support needed is a way to specify an arbitrary glyph?

The whole point here is; make the kernel be flexible, don't force it
into a single charset unless it's required (especially one that
many/most people don't like, and no one uses - even NT doesn't use it
that well, despite the marketting). Do you really want the *kernel*
to handle 20,000 character fonts (more if you're a native charset)?

> ext2 currently supports UTF-8 as an encoding; unless you have
> characters outside the ASCII range, it's also using UTF-8, since it was
> designed to preserve that range. For filenames, as long as we don't want
> the kernel to ensure that only an integral number of variable-width
> characters are stored, I agree with you: the kernel doesn't need to know
> about the external encoding, and shouldn't know about it.

Ext2 *supports* UTF-8, but it doesn't assume it and doesn't require
it. That is, if you pass a UTF-8 string (of small enough size), ext2
just transparently handles it. That's different from saying it uses
UTF-8.

> However, my personal belief is that there should be a policy in
> the kernel to only allow whole characters to be stored; in this case the
> kernel will need to know what encoding is used for file names.

I somewhat agree here (I hadn't considered this point). But as said
earlier, perhaps a module could handle it. All that's really needed
is a simple regexp (unicode, plus SJIS and EUC are handled trivially
with a regexp, though JIS is a bit trickier because of the shift-out
sequence, don't know what BIG5 does).

Libc should be able to handle this as well (at least when being read
back, since it can detect a char that's been chopped up)