Re: UTF-8 practically vs. theoretically in the VFS API

From: John Bradford
Date: Mon Feb 16 2004 - 14:40:04 EST


Quote from Jeff Garzik <jgarzik@xxxxxxxxx>:
> Linus Torvalds wrote:
> > In short: filenames are byte streams. Nothing more. They don't even have a
> > "character set". They literally are just a series of bytes.
> >
> > And when I say that you have to talk to the kernel using UTF-8, I'm only
> > claiming that it is the only sane way to encode extended characters in a
> > byte stream. Nothing more.
>
>
> Nod. Maybe it helps Marc to point out the key difference between
> characters and bytes, in UTF8.
>
> In UTF8, the number of characters in a string is less-than-or-equal-to
> the number of bytes in the string.
>
> And the kernel just cares about bytes.
>
> This is the whole benefit to UTF8, right here in this thread. UTF8 was
> designed such that ten-year-old C code using standard C strings would
> function just fine. No need to rip up large swaths of your code just to
> call multi-byte versions of the standard string functions. Most code
> that doesn't deal with locale-specific details like uppercase/lowercase
> Just Works(tm).

The real problem is with mis-configured userspaces, where buggy UTF-8
decoders are trying to make sense of data in legacy encodings
containing essentially random bytes > 127, which are not part of valid
UTF-8 sequences.

None of this is a real problem, if everything is set up correctly and
bug free. Unfortunately the Just Works thing falls apart in the,
(frequent), instances that it's not :-(.

John.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/