Re: UTF-8 practically vs. theoretically in the VFS API (was: Re:JFS default behavior)

From: Linus Torvalds
Date: Tue Feb 17 2004 - 16:06:52 EST




On Tue, 17 Feb 2004, John Bradford wrote:

> > Ok, but... why? What does 32-bit words get you that UTF-8 does not?
> > I can't think of a single advantage, just lots of disadvantages.
>
> The advantage is that you can use them to store UCS-4.

Wrong. UTF-8 can store UCS-4 characters just fine.

Admittedly you might need up to six octets for the worst case, but hey,
since you only need one for the most common case (by _far_), who cares?

And with the same UTF-8 encoding, you could some day encode UCS-8 too if
the idiotic standards bodies some day decide that 4 billion characters
isn't enough because of all the in-fighting.

> Now, for file _contents_ this would be a compatibility disaster, which
> is why UTF-8 is so convenient, but for file_names_ UCS-4 lets you
> unambiguously represent any string of Unicode characters.

Why do you think UTF-8 can't do this? Did you read some middle-aged text
written by monks in a monestary that said that UTF-8 encodes a 16-bit
character set?

> Basically - no more multiple representations of the same thing. No more
> funny corner cases where several different strings of bytes eventually
> resolve to the same name being presented to the user.

Welcome to normalized UTF-8. And realize that the "non-normalized" broken
stuff is what allows us backwards compatibility.

Of course, since you like UCS-4, you don't care about backwards
compatibility.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/