Re: OT: character encodings (was: Linux 2.6.20-rc4)

From: Russell King
Date: Sun Jan 07 2007 - 10:39:03 EST


On Sun, Jan 07, 2007 at 11:13:57PM +0800, David Woodhouse wrote:
> On Sun, 2007-01-07 at 14:06 +0100, Tilman Schmidt wrote:
> > Russell King schrieb:
> > > Welcome to the mess which the UTF-8 charset creates.
>
> Utter bollocks.

Wrong. The problem is partly caused by not everything understanding
multi-byte character encodings, and text files containing absolutely
_no_ information about their character encodings.

When a text file is stored on disk, there's no way to tell what
character set the characters in that file belong to. As a result,
ISO-8859-1 folk assume that all text files are ISO-8859-1 encoded.
UTF-8 folk assume all text files are UTF-8 encoded. This leads to
utter confusion.

To see what I mean, try the following:

$ git log | head -n 1000 > o
$ file -i o
o: text/x-c; charset=iso-8859-1

According to that, the charset of the 'git log' output (which on that
test included Leonard's entry) is iso-8859-1, and by that Linus' mailer
was right to include it as ISO-8859-1.

In reality, the output from git log contains an ad-hoc collection of
character sets making its interpretation under any one character set
incorrect.

> > The problem of different character encodings coexisting on the same
> > platform, and the resulting occasional messing-up, far predates Unicode.
> > I distinctly remember one case of being bitten by this myself in 1977
> > when Unicode wasn't even on the horizon yet, and I don't think that was
> > the first time.
>
> Indeed. If you take arbitrary content and send it out to the world
> labelled as ISO8859-1, of _course_ you're likely to be corrupting it.
>
> Far from being the cause of the problem, UTF-8 actually offers the
> chance of a _solution_. Because once the Luddites catch up, it'll
> largely eliminate the need for using the multitude of legacy character
> sets and converting between them -- and the problem of mislabelling will
> fairly much go away.

In other words, the UTF-8 luddites require the entire Internet to
upgrade to UTF-8 for UTF-8 to work properly.

I _regularly_ struggle with idiotic programs that assume that the world
is UTF-8 and nothing else. UTF-8 does _not_ solve these inter-operability
problems - it only makes the entire situation worse by introducing yet
another different charset. (Yes, it's also true that there are programs
which assume the world is only another, different, character set.)

Rather than having these problems fixed properly (by looking at the LANG
environment variable) many of these programs now assume that the world
is UTF-8. It isn't.

elinks is one such program. It now assumes UTF-8 _only_ displays.
That's no better than programs which assume ISO-8859-1 only or US-ASCII
only.

So, in short, UTF-8 is all fine and dandy if your _entire_ universe
is UTF-8 enabled. If you're operating in a mixed charset environment
it's one bloody big pain in the butt.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/