Re: UTF-8 practically vs. theoretically in the VFS API

From: Linus Torvalds
Date: Mon Feb 16 2004 - 14:53:11 EST




On Mon, 16 Feb 2004, John Bradford wrote:
>
> The real problem is with mis-configured userspaces, where buggy UTF-8
> decoders are trying to make sense of data in legacy encodings
> containing essentially random bytes > 127, which are not part of valid
> UTF-8 sequences.
>
> None of this is a real problem, if everything is set up correctly and
> bug free. Unfortunately the Just Works thing falls apart in the,
> (frequent), instances that it's not :-(.

The way to handle that is to aim to never _ever_ decode utf-8 unless you
really have to. Always leave the string in utf-8 "raw bytestring" mode as
long as possible, and convert to charater sets only when actually
printing.

If you do that, then at worst you'll show the user a strange name (extra
points for marking it as being errenous), but everything still works. You
can still lookup/delete/whatever the file (internally the program still
works on the raw byte sequence and isn't confused). Basically accept the
fact that UTF-8 strings can contain "garbage", and don't try to fix it up.

And no, I'm not claiming that it's wonderfully clean and that we should
all love it. But it's _practical_, and the ugliness is certainly a lot
less than in the alternatives.

And it largely works today.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/