Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)

From: John Bradford
Date: Thu Feb 12 2004 - 13:59:23 EST


> > I'm not sure whether it's valid UTF-8 or not, but it's certainly
> > possible to code, for example, an 'A', (decimal 65), via an escape to
> > a 31-bit character representation. Presumably the majority of UTF-8
> > parsers would decode the sequence as 65, rather than emit an error.
>
> There are many ways of getting things wrong. The algorithm for encoding
> UTF-8 doesn't give you the option of encoding 65 as two bytes; any UCS-4
> character with code 0-0x7F must result in a onand the same principle goes
> for every other character and the unicdeo standard forbids the use of anything
> but the shortest possible sequence.

The recommended encoding algorithm forbids anything but the shortest
sequence, yes, but what will the majority of decoders do? I suspect
that at least some will follow the usual networking rule of be liberal
in what you accept, which for filenames may well cause all sorts of
security holes.

> > Also, even ignoring that, how do you handle things like accented
> > characters which can be represented as single characters, or as
> > sequences containing combining characters? Some applications might
> > convert the sequence containing combining characters in to the single
> > character, and others might not.
>
> In UTF-8 you cannot represent à as `a. I can have both in a file name and they
> are different. An application that assumes `a is the same a à (in UTF-8) is broken
> and should be fixed.

Well, as long as every userspace implementation gets it correct, we'll
be OK. Personally, I doubt they all will, especially those that
convert from legacy encodings to Unicode, although quite possibly the
above scenario with combining characters is not likely to happen for
filenames. Or is it? What about copying a file from a filesystem
with a UTF-8 encoding to a filesystem with a legacy encoding, and then
back again?

However, I am less concerned about this second scenario than the first.

John.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/