Re: Unicode details (no war!), the kernel, and filenames

James Mastros (root@jennifer-unix.dyn.ml.org)
Wed, 27 Aug 1997 21:14:51 -0400 (EDT)


On Wed, 27 Aug 1997, Teunis Peters wrote:
> [clipped by James Mastros <root@jennifer-unix.dyn.ml.org> on interface to
> translation module - could work]
BTW - This is actually a response to somthing I wrote.
> >
> > This sounds somewhat OK. Unicode is fine for internal formats.
> > But I don't more problems are solved than are introduced.
> >
> > One ugly part is that for a file, the contents are the responsibility
> > of user space, and the file name is the responsibility of kernel space.
>
> Umm - how difficult would it be for userspace to handle filenames?
> [this may seem kinda strange until you look at a filesystem as yet another
> database system... it'd be kinda fun to name a file the colour blue :]
True, but it breaks things all over the place... The only way I can think of
to do this is to have lookup functions return the filename as stored, and
let userspace sort through all the details itself. But this creates problems:

1) The raw filenames might have 0x00 or 0x2f (null and '/') as part of a
wide character, therby screwing stuff up.
2) The kernel gets mount options, userspace dosn't.

> > Another ugly part is, you don't know what encoding most FS's actually
> > use. That is, if you've got a file name on ext2fs, how do you know
> > how to convert it to UTF-8? Or an imported ufs disk? What if ext2fs
> > has some files in one encoding, and others in a different one?
>
> Hmm.... Standards are good [pity there's so many].. Though by and large
> it's standardized on UTF-8.... supposedly... (if anyone bothers paying
> attention)
The only place I've ever heard that is this thread. So, since a standard
that isn't known is no standard, it's really closer to a "whatever you put
there" standard. It's just that "most people" (please don't argue wether
the greater number of people use non-ASCII characters in filenames) use
UTF-8 without relising it, since UTF-8 is a superset of their "local"
charset, for the commonaly used range, anyway. (accented charcters are
probably different, but A-Za-z_\-\. are in the same place in many (most?)
charsets, including UTF-8.)

But not raw unicode. The advantages to raw unicode are:
1) It always has 16 bits per charcter. Never less, never more.
2) It screws everybody over equally (more or less).
3) Stuff dosn't work half-way. Either it works or it dosn't. That way, we
don't not notice that somthing crashes on characters > 0xFF simply because
nobody happened to use that function with such characters before.

> > For NTFS, yes it makes sense to convert to UTF-8 and pass that on;
> > because we know exactly what encoding it always uses, and we need to
> > handle this in kernel space (so we don't have null chars in
> > filenames). Yes, you've just solved NTFS's problem; but that could
> > have been done solely inside of the NTFS handler.
>
> And VFAT and CD-filesystems (DVD, joliet <grr>) and SMB... any other
> takers?
> Hey - this might even solve some of the translation problems with HFS
> (Macintosh - ':' is invalid and '/' is acceptable in a filename)

Not really. ':' and '/' are two different characters, you wouldn't have
them map to each-other in the unicode<->(whatever charset HFS uses) tables.

> I don't know who plans on using DVD disks but if there's going to be
> support UTF-8 is mandatory (or was that UCD? I thought it was UTF-8.
> Much same, slightly different encoding)... (either that or Linux joins
> Mickysloth in inventing new standards).

It's Unicode, but not with a UTF-8 encoding. That's why I said function
callbacks for the generic charset translation, not tables. That way
unicode-related charsets (I consider UTF-8, raw unicode, and that DVD thing
sepperate (but closly related) charsets) can do the function thing, whereas
others can have a function that calls a generic table-based function.

>
> G'day, eh?
> - Teunis
>
> PS : to reiterate, how difficult WOULD it be to make filenames completely
> a userspace issue?
I think it would be /real/ tough. OK, most of the work is already done
(user-fs project). But it would break stuff, and user-fs really isn't any
better then what we have now wrt translation. User-fs still has one central
authority that handles filenames that dosn't know what charset the
user-program wants. And, IMnsHO, that is what really matters. To
re-iterate my basic tennants of charset translation (which, btw, aren't
quite the same as when I started on this track...)

1. Each peice of the system should be able to use any charset without
influencing or being influenced by any other part (each program, and each
filesystem).
2. The default should satisfy the most people possible (this is probably a
simple 8-bit clean interface).
3. The least number of people should be dis-satisified. (This is probably
Unicode. Sorry Alex. If you want BIG-5 or somthing, go ahead, and see what
I said in number two. I see you bashing Unicode, but I see nothing better.)

I got this as my fortune, I think it pertains here:

No extensible language will be universal.
-- T. Cheatham

If you want to flame me without any better idea, my address is
root@jennifer-unix.dyn.ml.org. If you have any other viable ideas, feel
free to mail the list.

-=- James Mastros