On Tue, Apr 29, 2008 at 10:29:11AM +0300, Adrian Bunk wrote:So don't use that particular tool, and/or file a bug with the maintainer. :-)
On Tue, Apr 29, 2008 at 07:06:05AM +0200, Willy Tarreau wrote:
On Mon, Apr 28, 2008 at 06:29:43PM -0700, H. Peter Anvin wrote:Non-ancient distributions default to UTF-8 and have tools that handle it fine.
Willy Tarreau wrote:Or simply because people have not migrated all their install, or have
Is this really needed Adrian ? I mean, everyone reads iso-8859-1, not"Everyone" who speaks a Western European language, perhaps; and even then, mostly because a lot of tools still have a "oh, it's not valid UTF-8, guess iso-8859-1" mode.
everyone reads UTF-8.
explicitly disabled UTF-8 a few hours after starting to use it once
they discovered the mess it caused and the poor support from the
tools :-/
If you had bad experiences in the last millenium you should try again.
Well, I accidentally used a freshly installed laptop running mandriva 2008.
I was typing in a terminal inside KDE (I don't know the program name, sort
of an xterm, but with huge borders all around). I made a typo in a word and
typed in a "Ã" (e acute). Pressing backspace to fix it showed me that I
remove more chars than typed. I tried again. Pressing this letter 5 times,
then 10 times backspace. I removed 5 chars from the prompt. I suspect that
if I had used some chars with wider encoding (eg 4 bytes), I could have
removed as many... Clearly those tools are not ready.
Also, I recently upgraded one machine from 2.6.22 to 2.6.25. Same crappyOutside the english-speaking world, userland _was_ completely
behaviour on the console (with bash). I quickly set the vt.defaults on
the kernel command line to fix the problem.
At this stage, I'm not even trying to "fix" the problem, as it's
a philosophical debate and I do not want to enter it. Some people
consider it normal that we break user-space applications and that
it's obvious that all useland code has to be replaced to remain
compatible with "evolutions", and I simply do not support this
principle.
I just care about having the ability to disable theConsider the alternative - disable the broken behavior by using a
broken behaviour. Most of the problem comes from the variable
length characters causing wrapping lines and misplaced tabs when
read in non UTF-8 aware editors and/or terminals.
It has been done for years because there were no other choice. If youAnd do we really consider that people's names in *comments* cannotAccents are very rare in names in the kernel.
be converted to pure ASCII ? I'm western european and have always
been against accents in comments (another reason to write comments
in english BTW).
Most non-ASCII characters are umlauts and there's no sane way to express them in ASCII (and the vowels without umlaut are pronounced quite differently and might even make names look very strange).
Agreed, but it's been done for *years*. I received mails from people
spelled "jorn" or "jurgen" and they had no trouble using that spelling
in their names or mail addresses.
Lots of people actually bothered - and created various encoding schemesAnd that's only within European languages, outside it becomes even worse.
Unix and internet have lived without accents for
almost 30 years without anyone really bothering. And now we try to
Such "rules" may work for kernel comments specifically.put them everywhere (even in domain names, implying big security
issues) and it causes real annoyances. People's names have not
changed in 30 years, so I guess that the rules used during this
time to ASCII-fy the names are still usable.
And all those that actually used those "different charsets" disagree,The comments in the kernel have been converted to UTF-8 quite some time ago, what I'm fixing with my patch is just some recent non-UTF-8 stuff that creeped in.
Well, if that had already begun, at least you're standardizing.
And names in comments in the kernel were not pure ASCII since very early, they were in other charsets.
Mostly iso-8859-1, but not all of them.
I remember that for one name we first guessed which character it was and then tried to figure out which charset it was in (no, it was not one of iso-8859-*).
So it was not "ASCII -> UTF-8", it was
"several different charsets -> UTF-8".
I would have loved to see "several different charsets -> ASCII".