[RFC] Filesystem name storage (Was: A Great Idea (tm) about reimplementing NLS.)

From: Kyle Moffett
Date: Wed Jun 15 2005 - 23:01:20 EST


On Jun 15, 2005, at 21:55:04, Patrick McFarland wrote:
On Wednesday 15 June 2005 05:13 am, Denis Vlasenko wrote:
I do not understand how this is going to look from userspace perspective.
Can you give examples how this will work?

IMHO, he means that the userspace would only see Unicode filenames, and the
userspace could only give Unicode names back to the kernel. The kernel, using
this global NLS layer would translate back and forth, and the userland
wouldn't know about it.

Its basically the only sane way to approach the problem of getting the entire
Linux community to convert to Unicode.

Would the following system for filenames resolve most of the issues people
are raising:

First load charset tables into the kernel. These would be stored in files in
userspace and could be easily updated, renamed, deleted, etc. Such a table
would always be a translation from Unicode <=> Charset. A kernel with this
system built in would understand natively "raw", "utf8", "utf16", and "utf32",
anything else would need loaded charset tables.

The following mount options would available:
nls_raw=(0|1) [default 1]:
This would cause Linux to pass all chars through unmolested. This mode
works well on multiuser systems where users want to use their own NLS
tools, or where the whole system uses UTF-8, including the filesystems.
This is backwards compatible with the way Linux currently presents most
(all?) filesystems. If the options "nls_disk" or "nls_user" are used,
then this option is forced to be zero.
nls_disk=<string-charset>
This specifies the underlying charset which should be used on the disk
or filesystem itself. This may be "negotiate" for any filesystems
which support NLS *and* can identify which charset is in use. Built in
options are "utf8", "utf16", and "utf32". Defaults to "negotiate" if
available otherwise "utf8", but only defaults if "nls_raw" is 0.
nls_user=<string-charset>
This specifies the charset which should be presented to the user. This
may be used to allow a backwards compatibility (IE: A program wants
ISO8859-1, but the admin wants the underlying filesystem to use UTF-8.
Built in options are "utf8", "utf16", and "utf32". Defaults to "utf8"
if "nls_raw" is 0.

The end result is that specifying either nls_disk or nls_user will turn on
automatic NLS conversion, with the unspecified nls_ option being utf8.

If these options are used on bind mounts, they should override the underlying
filesystem's mount options (Instead of stacking). This will allow the admin
to specify:

# mount -t ext3 -o nls_disk=utf8,nls_user=utf8 /dev/hdb /mnt
# mount --bind -o nls_disk=utf8,nls_user=iso8850-1 /mnt/mail /var/ spool/mail

if he/she wants to provide backwards compatibility with a legacy mail
spooling program. Note: A part of each translation table would be an
entry for "Unspecified character", such that any UTF-8 character not mapped
in the table could be translated to a sane default, such as '?'. If names
collide under such translation, the kernel would need a way to keep track of
the collisions (Appended numbers?) and properly re-resolve them when asked.

Cheers,
Kyle Moffett

-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCM/CS/IT/U d- s++: a18 C++++>$ UB/L/X/*++++(+)>$ P+++(++++)>$
L++++(+++) E W++(+) N+++(++) o? K? w--- O? M++ V? PS+() PE+(-) Y+
PGP+++ t+(+++) 5 X R? tv-(--) b++++(++) DI+ D+ G e->++++$ h!*()>++$ r !y?(-)
------END GEEK CODE BLOCK------

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/