Re: UTF-8 practically vs. theoretically in the VFS API (was: Re:JFS default behavior)

From: Linus Torvalds
Date: Mon Feb 16 2004 - 17:42:46 EST




On Mon, 16 Feb 2004, Jamie Lokier wrote:
>
> Alas, once userspace has migrated to doing everything in UTF-8, you
> won't be able to read those files because userspace will barf on them.

Nope. Read my other email. Done right, user space will _not_ barf on them,
because it won't try to "normalize" any UTF-8 strings. If the string has
garbage in it, user space should just pass the garbage through.

We've had this _exact_ issue before. Long before people worried about
UTF-8, people worried about the fact that programs like "ls" shouldn't
print out the extended ASCII characters as-is, because that would cause
bad problems on a terminal as they'd be seen as terminal control
characters.

Does that mean that unix tools like "rm" cannot remove those files? Hell
no! It just means that when you do "rm -i *", the filename that is printed
may not have special characters in it that you don't see.

Same goes for UTF-8. A "broken" UTF-8 string (ie something that isn't
really UTF-8 at all, but just extended ASCII) won't _print_ right, but
that doesn't mean that the tools won't work. You'll still be able to edit
the file.

Try it with a regular C locale. Do a simple

echo > åäö

(that's latin1), and do a "rm -i åäö", and see what it says.

Right: it does the _right_ thing, and it prints out:

torvalds@home:~> rm -i åäö
rm: remove regular file `\345\344\366'?

In other words, you have a program that doesn't understand a couple of the
characters (because they don't make sense in its "locale"), but it still
_works_. It just can't print them.

Which, if you think about is, is 100% EXACTLY equivalent to what a UTF-8
program should do when it sees broken UTF-8. It can still access the file,
it can still do everything else with it, but it can't print out the
filename, and it should use some kind of escape sequence to show that
fact.

The two cases are 100% equivalent. We've gone through this before. There
is a bit of pain involved, but it's not something new, or something
fundamentally impossible. It's very straightforward indeed.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/