Re: [Patch] Support UTF-8 scripts

From: Valdis . Kletnieks
Date: Sun Sep 18 2005 - 17:30:20 EST


On Sun, 18 Sep 2005 21:23:42 +0200, Bodo Eggert said:
> Bernd Petrovitsch <bernd@xxxxxxxxx> wrote:
> > Apparently I have to repeat: If you do `cat a.txt b.txt >c.txt` where
> > a.txt and b.txt have this marker, then c.txt have the marker of b.txt
> > somewhere in the middle. Does this make sense in anyway?
> > How do I get rid of the marker in the middle transparently?
>
> The unicode standard defines how to handle them.

For the benefit of those of us who are interested in the problem, but aren't
in the mood to wade through a long standard looking for the answer to a
specific question, can you elaborate?

It isn't as obvious as all that, because of all the nasty corner cases...

> > It is different even if a pure ASCII file is marked as UTF-8.
>
> No pure ASCII file will be marked, since a marked file will be no
> ASCII file.

Given a file "a.txt" that's pure ASCII, and a file "b.txt" that has the BOM
marker on it, what happens when you do "cat a.txt b.txt > c.txt"?

'cat' doesn't know, and has no way of knowing, that c.txt needs a BOM at the
*front* of the file until it's already written past the point in c.txt where
the BOM has to go.

What does the Unicode standard say to do in this case?

Attachment: pgp00000.pgp
Description: PGP signature