Re: UTF-8 practically vs. theoretically in the VFS API (was: Re:JFS default behavior)

From: Linus Torvalds
Date: Mon Feb 16 2004 - 17:58:27 EST




On Mon, 16 Feb 2004, Linus Torvalds wrote:
>
> Which, if you think about is, is 100% EXACTLY equivalent to what a UTF-8
> program should do when it sees broken UTF-8. It can still access the file,
> it can still do everything else with it, but it can't print out the
> filename, and it should use some kind of escape sequence to show that
> fact.

Side note: a UTF-8 program needs to do escape handling _anyway_, because
even if the filename is 100% UTF-8 compliant, you still can't print out
all the characters as such. In particular, charcters like '\n' etc are
obviously perfectly fine UTF-8, yet they need to be escaped when printing
out filenames in a file selector.

So I claim (and yes, people are free to disagree with me) that a
well-written UTF-8 program won't even have any real extra code to handle
the "broken UTF-8" code. It's just another set of bytes that needs
escaping, and they need escaping for _exactly_ the same reason some
regular utf-8 characters need escaping: because they can't be printed.

So it's all the same thing - it's just the reasons for "unprintability"
that are slightly different.

Now, I'll agree that getting the escaping right (whether for things like
'\n' or for byte sequences that are invalid UTF-8) can be painful. I just
don't think that the pain is in any way specific for "invalid UTF-8". It's
just _hard_ to think of all the special cases, and most programs have bugs
because somebody forgot something.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/