Unicode in kernel

Albert Cahalan (acahalan@lynx.dac.neu.edu)
Thu, 28 Aug 1997 02:25:24 -0400 (EDT)


Was "Re: UTF-8, OSTA-UDF [why?], Unicode, and miscellaneous gibberish".
Darin Johnson (darin@connectnet.com) writes:

> So - back to the *kernel*: I would think that even supporters of
> unicode, at this point, can see that it is still a very contentious
> standard, controversial enough that there is a high possibility of the
> standards changing a lot in the future, or the standard being ignored
> by lots of people (and an unused standard isn't really a standard
> anymore). Thus it's too controversial for standardization inside of
> Linux. Leave it to user space libraries and linux distributions.
> Later, if unicode does become widely accepted, then think about adding
> it in the kernel.

Users may ignore Unicode. They are users!
("Don't tell me about high-tech stuff like ASCII.")

Programmers may not ignore it, because it is here to stay.
It is in: vfat, ntfs, new MacOS filesystem, plan 9, Java, smb/cifs,
various web standards, joliet, UDF, NT kernel interface...
With that kind of support, it is not going to just go away.

> (and for heavens sake, if someone does add it to the kernel;
> make it a compile time option!!!)

To the extent that quota support is, yes.

For binary compatibility, new 16-bit kernel calls will need to exist
everywhere. People that don't want full Unicode support could have
the kernel manipulate text as 8-bit. That means it translates when
an app calls a 16-bit system call. Full Unicode systems manipulate
text as 16-bit and translate for the old 8-bit system calls.

That assumes we should have only one libc and static linking should
work accross all systems.

The new system calls are really a requirement. You can't run older
8-bit apps and UTF-8 ones on the same system using the same system
calls. Nobody will make the difficult transition, suffering from
incompatibility in _many_ hidden places. UTF-8 looks like a great
source of subtle bugs. With raw Unicode system calls, older apps
won't break in strange ways. They fail in more obvious ways if you
try to feed tham raw Unicode, but they use the old system calls to
avoid getting any.