Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)

From: Robin Rosenberg
Date: Thu Feb 12 2004 - 13:08:42 EST


On Thursday 12 February 2004 18.16, John Bradford wrote:
> I'm not sure whether it's valid UTF-8 or not, but it's certainly
> possible to code, for example, an 'A', (decimal 65), via an escape to
> a 31-bit character representation. Presumably the majority of UTF-8
> parsers would decode the sequence as 65, rather than emit an error.

There are many ways of getting things wrong. The algorithm for encoding
UTF-8 doesn't give you the option of encoding 65 as two bytes; any UCS-4
character with code 0-0x7F must result in a onand the same principle goes
for every other character and the unicdeo standard forbids the use of anything
but the shortest possible sequence.

> Also, even ignoring that, how do you handle things like accented
> characters which can be represented as single characters, or as
> sequences containing combining characters? Some applications might
> convert the sequence containing combining characters in to the single
> character, and others might not.

In UTF-8 you cannot represent à as `a. I can have both in a file name and they
are different. An application that assumes `a is the same a à (in UTF-8) is broken
and should be fixed.

-- robin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/