Re: UTF-8 practically vs. theoretically in the VFS API

From: Linus Torvalds
Date: Mon Feb 16 2004 - 15:30:17 EST




On Mon, 16 Feb 2004, Marc Lehmann wrote:
>
> On Mon, Feb 16, 2004 at 11:48:35AM -0800, Linus Torvalds <torvalds@xxxxxxxx> wrote:
> > works on the raw byte sequence and isn't confused). Basically accept the
> > fact that UTF-8 strings can contain "garbage", and don't try to fix it up.
>
> But you are wrong, UTF-8 strings never contain garbage. UTF-8 is
> well-defined and is always proper UTF-8. It's a tautology.
>
> The evry idea of "UTF-8 with garbage in it" doesn't make sense.

Sure it does.

You live in a theoretical world where
(a) there is only one standard
(b) people read it
(c) people actually follow it and never have bugs

I've got news for you: none of the above is true.

Which means that IN PRACTICE you will find strings that you think are
UTF-8-encoded, but that don't end up being proper UTF-8.

That's the difference between real world and theory.

And you can either write your programs to be "theoretically correct", or
you can write them to "work".

It's your choice. I know which program I'd prefer to use.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/