Re: [2.6 patch] UTF-8 fixes in comments

From: Willy Tarreau
Date: Tue Apr 29 2008 - 07:07:10 EST


On Tue, Apr 29, 2008 at 01:42:16PM +0300, Adrian Bunk wrote:
> On Tue, Apr 29, 2008 at 12:09:34PM +0200, Willy Tarreau wrote:
> > On Tue, Apr 29, 2008 at 11:06:05AM +0200, Helge Hafting wrote:
> > > >Well, I accidentally used a freshly installed laptop running mandriva 2008.
> > > >I was typing in a terminal inside KDE (I don't know the program name, sort
> > > >of an xterm, but with huge borders all around). I made a typo in a word and
> > > >typed in a "é" (e acute). Pressing backspace to fix it showed me that I
> > > >remove more chars than typed. I tried again. Pressing this letter 5 times,
> > > >then 10 times backspace. I removed 5 chars from the prompt. I suspect that
> > > >if I had used some chars with wider encoding (eg 4 bytes), I could have
> > > >removed as many... Clearly those tools are not ready.
> > > >
> > > So don't use that particular tool
> >
> > It was not my machine, and had you been there, you would have heard me call
> > it names !
> >
> > > and/or file a bug with the maintainer. :-)
> >
> > It's too easy to impose crappy designs to end-users and tell them that if
> > that does not work they have to file a bug. There are a minimal set of
> > things that must be tested before shipping. Seeing that the default
> > terminal emulator in KDE on Mandriva 2008 is configured in UTF-8 and does
> > not properly render it simply makes me sick. This is broken by design and
> > even distros trying to get it working for years still can't cope with it.
> > There must be a reason.
>
> I can reproduce your problem in a plain xterm when setting LANG=en_US
> (most likely the same problem can occur with other non UTF-8 settings).

possibly they broke it when forcing support for variable length ?

> In this case I'm actually more surprised that the character is displayed
> correctly than that you have to type backspace twice.

It's not that I *had* to type it twice. But I *could* type it twice, and
the first one removed the character, the second one the prompt.

> Any kind of charset mixing is highly problematic (which is also why my
> patch was attached compressed), so if you disable UTF-8 anywhere in a
> modern distribution problems are somehow expected (it could also be a
> bug in Mandrivas default settings, but that would really surprise me).

No, it was not disabled at all. I had to type in a command for a
co-worker who just did a default install the day before, and typed a
typo which I wanted to fix.

> > Unicode yes, UTF-8 no. UTF-8 is a compressed encoding of unicode.
> > That's as silly as if you had to replace your terminals to read
> > native gzip, and expect them as well as all the tools to work
> > properly!
>
> It's not a compressed encoding, it's a variable-length encoding.
>
> Besides the size advantages one main advantage of UTF-8 is that ASCII is
> valid UTF-8. This means that for the ASCII source code in the kernel it
> doesn't matter whether it's treated as ASCII or UTF-8, and no conversion
> was needed.
>
> You can't get this property with a fixed-size Unicode encoding.

I don't agree. If you refuse character-set mixing, there's no problem.
Bit 7 of first char == 1 ? => full text is 32 bit.

Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/