Re: OT: character encodings

From: Adrian Bunk
Date: Sun Jan 07 2007 - 21:00:16 EST


On Mon, Jan 08, 2007 at 02:32:42AM +0100, Tilman Schmidt wrote:
> Am 08.01.2007 01:38 schrieb Willy Tarreau:
>...
> > And I'm not even
> > discussing the stupidity which requires that you read a whole text to get
> > its number of characters !
>
> Personally I find the requirement to know the number of characters in a text
> rather unusual, so I wouldn't base a decision for an encoding on that. In
> fact, I cannot remember ever really wanting to know the actual number of
> characters in a text. The number of bytes occupied on storage, ok. The
> number of letters, of words, of lines, perhaps even the number of printable
> characters, all potentially interesting, depending on the application.
> But the raw number of characters? I don't know what that might serve for.

Also note that the UTF-32 Unicode encoding would offer this property,
but with the following disadvantages compared to the UTF-8 Unicode
encoding:
- 7bit ASCII is not a subset of UTF-32 losing a lot of compatibility
(code 7bit ASCII with some UTF-8 in the comments is no problem
for not-Unicode aware systems except for slight misdisplayments
of the comments)
- UTF-32 has up to 4 times the size of UTF-8

There's also the point that you can use e.g. "wc" or your editor for
counting the characters.

> HTH
> Tilman

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/