Re: unicode

Theodore Y. Ts'o (tytso@MIT.EDU)
Sat, 16 May 1998 01:01:55 -0400


Date: Sat, 16 May 1998 01:19:13 +0200 (MET DST)
From: dwguest@win.tue.nl (Guest section DW)

tytso@mit.edu writes:

> And for ext2, the default filename encoding *will* be UTF-8.

Ah, very good! Thank you!

(Does the ext2 filesystem depend in any way on what character set is
used? No. Then what meaning could a statement like `ext2 uses UTF-8'
have? Clearly, it pleases you to say this, and as long as these
words do not have any effect on kernel code or e2fs utilities, nobody
cares what you say the character set is. I was afraid that you might
attach a real meaning to these words, but now that you say `default',
it just means that any byte sequence that is not a valid UTF-8 string
only occurs on non-default ext2 systems.

What I mean by default is that at some point we might add support for a
single bit in the directory entry to indicate "this was encoded using an
old "just send 8 bits" system, for transition away from folks who are
just using their local character set.

However, by default, it is fair game for any future kernel extension for
handling internationalization (probably using a UCS-2 or UCS-4
interface) can assume that the ext2 file format is encoded using UTF-8.
Similarly, that same kernel extension can assume that the NTFS uses an
on-disk format of UCS-2, because that is what Microsoft defined for
their filesystem.

It's already the case that the NTFS code in the Linux kernel translates
the UCS-2 encoding found in the NTFS directory entries into ASCII, so
that the right thing happens when you type "ls" in an NTFS directory.
The fact of the matter is that this kind of translation has to be done
in the kernel. The assumption that filenames are "just octet strings"
and it's none of the business of the filesystem driver is simply a
falacy. If the NTFS driver did not do this translation, existing user
tools which assume ASCII would have broken, and broken badly.

No, the current Linux NTFS code doesn't translate other character sets,
but that's because we don't have a set of VFS interfaces which handle
internationalization. This has to be done in the kernel; it can't be
done in the user code, because different filesystems will define
different ways of encoding different character sets, and so the kernel
is going to have to have a set of VFS I18N interfaces, probably using
either UCS-2 or UCS-4 as the interface.

- Ted

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu