Re: [2.6 patch] UTF-8 fixes in comments

From: Adrian Bunk
Date: Tue Apr 29 2008 - 07:28:11 EST


On Tue, Apr 29, 2008 at 01:06:38PM +0200, Willy Tarreau wrote:
> On Tue, Apr 29, 2008 at 01:42:16PM +0300, Adrian Bunk wrote:
> > On Tue, Apr 29, 2008 at 12:09:34PM +0200, Willy Tarreau wrote:
>...
> > > Unicode yes, UTF-8 no. UTF-8 is a compressed encoding of unicode.
> > > That's as silly as if you had to replace your terminals to read
> > > native gzip, and expect them as well as all the tools to work
> > > properly!
> >
> > It's not a compressed encoding, it's a variable-length encoding.
> >
> > Besides the size advantages one main advantage of UTF-8 is that ASCII is
> > valid UTF-8. This means that for the ASCII source code in the kernel it
> > doesn't matter whether it's treated as ASCII or UTF-8, and no conversion
> > was needed.
> >
> > You can't get this property with a fixed-size Unicode encoding.
>
> I don't agree. If you refuse character-set mixing, there's no problem.
> Bit 7 of first char == 1 ? => full text is 32 bit.

You miss my point.

The point is:
A conversion "ASCII -> UTF-8" is a nop.

This means when changing the kernel from half a dozen charsets used in
comments to UTF-8 we only had to change the few characters actually
containing non UTF-8.

Going to something like UTF-32 as you suggest would have involved
converting every single file in the kernel.

> Willy

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/