Re: UTF-8 practically vs. theoretically in the VFS API

From: Linus Torvalds
Date: Wed Feb 18 2004 - 15:02:23 EST




On Wed, 18 Feb 2004, Jamie Lokier wrote:
> Linus Torvalds wrote:
> > Somebody correctly pointed out that you do not need any out-of-band
> > encoding mechanism - the very fact that it's an invalid sequence is in
> > itself a perfectly fine flag. No out-of-band signalling required.
>
> Technically this is almost(*) correct,
>
> (*) - It's fine until you concatenate two malformed strings. Then the
> out-of-band signal is lost if the combination is valid UTF-8.

But that's what you _want_. Having a real out-of-band signal that says
"this stuff is wrong, because it was wrong at some point in the past", and
not allowing concatenation of blocks of utf-8 bytes would be _bad_.

The thing, concatenating two malformed UTF-8 strings is normal behaviour
in a variety of circumstances, all basically having to do with lower
levels now knowing about higer-level concepts.

For example, look at a web-page. Look at how the data comes in: it comes
as a stream of bytes, with blocking rules that have _nothing_ to do with
the content (timing, mtu's, extended TCP headers etc etc). That doesn't
mean that you shouldn't be able to
- work on the partial results and show them to the user as UTF-8
- be able to concatenate new stuff as it comes in.

Having an out-of-band signal for "bad" would literally be a bad idea. If
you get a valid UTF-8 stream as a result of concatenation, you should
consider that to be the correct behaviour, or you should CHECK BEFOREHAND
if you think it is illegal.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/