Kernel & Unicode

Albert Cahalan (acahalan@lynx.dac.neu.edu)
Wed, 27 Aug 1997 06:13:46 -0400 (EDT)


8-bit national encodings (Russian KOI-8...) can work just find in a
world with a kernel that assumes Unicode. You have several options.

Option A
You can have the kernel translate everything. (required if you
also have Latin-2 and PC-OEM users on the same machine that want
to share filenames with common non-ASCII characters)

Option B
Let the kernel think everything is Latin-1, even though it is
not really. Your console font translation table is 1:1 if you
need a new font, or something strange if you keep the default.
Your "Unicode" is not really Unicode, so avoid NTFS.

This would be the same kernel, either aware or with defaults.

The kernel already does character set translation in many places.
We ought to just admit it and provide unified translation functions.
Right now several filesystems and the console have large tables
to do the translation. Others _should_ have translation, so that
I can put a Latin-1 floppy in an otherwise Latin-2 machine and
read filenames as well as possible.

What about large character sets? You must use Unicode. For all the
normal applications, the normal system calls _must_ remain _pure_
8-bit for the next 30 years. Sorry, UTF-8 and BIG5 both fail.

This problem can be fixed the same way other system call problems
get fixed: add a second set of system calls or a personality.
Only true 16-bit Unicode can work right, and it is not at all
compatible with the existing API.

New programs can use a CHAR typedef to support both 8-bit and 16-bit
systems. Java would only need fixes in the interpreter. I'd guess
there are C++ classes that can handle the matter. For all the old
software, mangling is appropriate

BTW, about NT: Of course you can't put Chinese into the standard
US English version!!! To reduce overhead, the US version is
stripped down a bit. Note that the kernel itself is still Unicode.
You just don't get the DLLs and fonts to do fancy text drawing.
Maybe the apps were compiled to run as 8-bit only. It is not a
kernel issue if the apps work that way. What is a kernel issue is
the way kernel calls work.

Sun and Microsoft both use the 2-byte encoding. It is best.
You can send that directly into the kernel, but not via the old
system calls. You need an open(2) that uses a 16-bit '/' for
the path and a 16-bit '\0' for the end of a string.