Unicode, etc. solution

Albert D. Cahalan (acahalan@cs.uml.edu)
Wed, 27 Aug 1997 04:29:45 -0400


Russian users (with KOI-8 I guess) can do very well in a
world that assumes Unicode & Latin-1 in the kernel.
They only lose (somewhat) when the kernel must deal with
filesystems that are _not_ in the native characterset.
Load native console data, avoid NTFS, and be happy.

The reason is simple: If the kernel assumes you use Latin-1
for both the filesystem and console, it won't mangle anything.
This is not exactly the same as ignoring the characterset!
It is a nice hack for 1-language 8-bit systems.

UTF-8 is not quite as bad as many native multi-byte encodings,
but it is still really bad. It can be used to store Unicode
filenames on a hostile network filesystem. It is also a weak
kind of compression for systems that are 95% ASCII and 5%
of some mixed/large language(s). Other than that, forget it.

BTW, a word about NT: Of course you can't put Chinese into
the standard US version!! People would just scream about the
overhead of both processing and disk space. Incompatibilities
make that idea just absurd. The kernel filesystem interface
supports it though, so you can just get new fonts and DLLs.

The kernel gets hit by characterset issues in several places:
display partly done
keyboard done
local ext2 filesystem some problems
nfs, ntfs, hfs, joliet oops, disaster
kernel API some problems

I think a fix starts by merging the _existing_ translation
tables for hfs, fat, joliet/ntfs/fat32, and the VGA console.
We can save some space by admitting that the kernel _does_
need generic translation functions. Gross, but required.

Next, the kernel must translate filenames. I want to put your
KOI-8 floppy in my system and read it the right way as well
as I can. If I convert to full Unicode, I want to read every
filesystem I can find. This requires a mount option for every
filesystem with a poorly defined character set.

To avoid needing that mount option, ext2 needs a characterset
flag at the very least. 0 is "unknown" of course, which gets
passed right through the default configuration. It seems that
this flag already exists in OS/2 hpfs.

That leaves only the kernel API. The standard way of fixing
an API will do quite well: alternate system calls for raw
16-bit Unicode. Only the calls that take/return 8-bit text
need alternates. The old calls do _not_ get depreciated, at
least not much. They need to use plain 8-bit (not multi-byte)
text and remain that way for the next 30 years at least.
For the new API, pick the byte order with Java and vfat in mind.
The '/' and '\0' are safe: the kernel uses 16-bit versions.

----

One last issue: It may be nice to allow an old API translation
beyond simple trunctuation. It is only needed when several
conditions are _all_ met:

1. Unicode trunctuation is not correct for the native encoding
2. The native encoding fits in 8 bits
(otherwise failure can not be avoided at all)
3. The user can not pretend everything is Latin-1

That would only happen with multiple 8-bit users that disagree
about the characterset and also need each others' filenames.
Otherwise, you can just use odd translation tables on the
filesystems or let each user have a different encoding.

That is just an extra though. It requires that processes or users
get tagged with a characterset. Most people won't need it and
the overhead is not good.