Re: UTF-8, OSTA-UDF [why?], Unicode, and miscellaneous gibberish

Michael Poole (poole+@andrew.cmu.edu)
Wed, 27 Aug 1997 01:07:13 -0400 (EDT)


On Tue, 26 Aug 1997, Darin Johnson wrote:

> > From: Michael Poole <poole+@andrew.cmu.edu>
>
> > In the kernel, I think the decision that needs to be made rests on
> > these points:
>
> You forgot an important question:
> - Does the kernel even need to concern itself with character encodings?

In my view, yes, for these reasons:
- Filenames should contain an integral number of characters, even if an
app tries to write a filename where the NAME_MAX-1'th byte isn't the last
byte in the wide character. This is debatable -- one can argue back and
forth all day whether or not that constitutes stupidity or misbehavior on
the application's part, but in the end I think it will boil down to
standards support or 'executive decision'.

- Console input is something that the kernel must know about (and which I
know relatively little about); having translation tables between scan
codes and what gets sent to an app would provide a flexible way to handle
arbitrary input. That scheme might make control- (or meta- or alt-) keys
harder to handle, or break other assumptions in the kernel.

- Console output is something else that the kernel must know about. My
first-hand experience here is only with what currently runs on Wintel
genre machines, but I think that remappable character bitmaps is about as
good as we can get for the Wintel text modes. For graphical consoles,
there are other issues. As I mentioned, I think a lot of users will
(fairly) think that the OS should support multilingual (or at least
wide-character) character set display and handling. I have not thought
about this in great depth, but I believe this is something the kernel
should mediate (particularly if variable-width character encodings are
used); it's possible that a working solution can (and therefore probably
should) run in user-space.

[snip]
> The kernel doesn't need to know what charset the filenames are in, it
> just needs to leave them alone. And this is a GOOD thing. If one
> Linux distributions decides to use Unicode; it can. If another
> distribution wants to use SJIS, they can. If another distribution
> comes up with a way of handling multiple charsets via escape
> sequences, so much the better. All camps should be happy here.
>
> If suddenly it is declared "EXT2 uses UTF-8"; how is it going to
> accomplish this? It surely won't go and translate anything coming
> from user-space, because it doesn't know what encoding those
> characters are. No, instead, such a proclamation would result in zero
> code changes. They kernel should just accept what it is handed,
> because it can't know enough to convert to/from any official encoding
> anyway without user-space help.

ext2 currently supports UTF-8 as an encoding; unless you have
characters outside the ASCII range, it's also using UTF-8, since it was
designed to preserve that range. For filenames, as long as we don't want
the kernel to ensure that only an integral number of variable-width
characters are stored, I agree with you: the kernel doesn't need to know
about the external encoding, and shouldn't know about it.
However, my personal belief is that there should be a policy in
the kernel to only allow whole characters to be stored; in this case the
kernel will need to know what encoding is used for file names. I strongly
suspect that this won't be implemented, though, due to either standards
compliance or for the benefit of supporting multiple encodings. There is
a very strong case to be made that character delineation should be left to
user-space, and if that's what prevails, so be it; I think that libc
should be able to implement the policy just as well as the kernel.

[snip]
> On the other hand; even though none of these needs to be in the
> kernal, and can be all done in user-space, one other question remains.
> What *can* be put into the kernel that would be useful and appropriate?

That's a good question. The character delineation I hit on above
is the only example I can think of, although others might exist.

> > - Input is another issue, but I don't feel qualified to comment on it;
> > I don't have any idea how it's currently handled or how
> > foreign-language input methods generally (or "should") work.
>
> These can be entirely userspace. Output might not be, because the
> kernel does do output. But there is no direct input to the kernel.
> (well, there's lilo command lines, but I doubt anyone going to put
> input methods into lilo :-)

On reconsideration, I think you're right. My premise on input
possibly needing kernel-level interpretation hinged on people using
keyboards which didn't show (or didn't support) the Roman alphabet (as a
basis for native character generation); however, I think that for as long
as a keyboard is a viable input device, most keyboards will support the
Roman alphabet.

> > Here are my
> > arguments on why we need something like Unicode or UTF-8 support in the
> > kernel, as a list of the features required:
> > * Unambiguous encodings of distinct characters within a language
>
> Unambigous encodings of a *subset* of distinct characters within a language :-)
> (unicode has 20K different "Han" characters, which leaves a lot of
> dictionaries out in the cold)
>
> But you don't say why this is needed in the *kernel*.

Sorry: this is to allow sane user-space implementations of string
handling. This isn't something that's necessary in the kernel, per se,
but rather a trait which I think the character set should have. Perhaps
my 'feature' doesn't quite make sense, but mostly I was arguing for
something where you don't need to consider surrounding characters when
considering two characters are equivalent. I don't know of any character
encodings which violate this idea, but it might be useful to remember ;).

> > * Relatively easy to find the begin and end of characters (not loads of
> > state), since it's bad to store fractional characters eg in a filename
>
> True. But for most native encodings, this is also true (especially if
> all you look for are "/" or "\0").

Correct: for most native encodings, this is true. I seem to
remember one Japanese encoding requiring state persistent over a whole
string (rather than within a multi-byte character) to detect begin and end
of a character, but I can't find a reference to this. The '/' might end
up being a non-initial character in a multi-byte sequence, and the '\0'
wouldn't help in all situations (since the example I gave -- storing a
fractional character in a file name -- is most likely to come up when the
filename is as long as it can be).

[snip]
> > From: "Svein Erik Brostigen" <SveinErik.Brostigen@ksr.okpost.telemax.no=
>
> > I, for one, would love to be able tohave both japanese,korean, thai and=
> > =20
> > norwegian characters on the screen at the saem time and without any tri=
> > cky=20
> > stuff to make this possible.
>
> Ironically, I can do this in MULE, which isn't unicode, but I can NOT
> do this in Windows NT, which is unicode.

I've had both English and Korean, and English and Croatian, on a
Windows NT display simultaneously, for contrast. It works reasonably
well, although finding a font which supports all the characters you want
is one problem, and multi-lingual input is another. (But never fear -- as
always, Microsoft says this will be fixed in the next version. Right. :)

Michael