Re: UTF-8 practically vs. theoretically in the VFS API

From: H. Peter Anvin
Date: Wed Feb 18 2004 - 15:13:30 EST


Linus Torvalds wrote:
>
> But that's what you _want_. Having a real out-of-band signal that says
> "this stuff is wrong, because it was wrong at some point in the past", and
> not allowing concatenation of blocks of utf-8 bytes would be _bad_.
>

Indeed. What it does mean, however, is that you have to consider your
concatenation issues if you perform the concatenation in UCS-4 space,
for example, a string that ends in whatever code you have chosen for
<BOGUS-C8> that gets concatenated with <BOGUS-80> needs to get converted
to a valid <U+0200>. This is of course not an issue if you do the
concatenation in UTF-8 space and don't do round-trip conversion.

None of this is hard, it just takes thinking about rather than
automatically do the obvious things.

> The thing, concatenating two malformed UTF-8 strings is normal behaviour
> in a variety of circumstances, all basically having to do with lower
> levels now knowing about higer-level concepts.

Indeed.

-hpa

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/