Re: UTF-8 and case-insensitivity

From: H. Peter Anvin
Date: Tue Feb 17 2004 - 03:07:00 EST


Followup to: <c0sgnc$ngo$1@xxxxxxxxxxxxxxxxxx>
By author: hpa@xxxxxxxxx (H. Peter Anvin)
In newsgroup: linux.dev.kernel
>
> Realistically, the only sane way to do this is to set our foot down
> and say: UTF-8 is *the* encoding. A good step in that direction would
> be to set utf-8 to be the default NLS in the kernel, but as long as
> people keep the whole sick idea that we can continue to use
> locale-dependent encoding we're in for a world of hurt.
>
> That's really the long and short of it. Until people are willing to
> say "we support UTF-8, anything else and it's anyone's guess what
> happens" then nothing is going to happen.
>

Oh yes, on top of that, if you want case insensitivity, then you also
need to start thinking about a whole lot of other things, including
what normalization form(s) you care about. Keeping normalization (as
well as case-conversion) data for the entire Unicode space in the
kernel is a boatload of memory.

Then, you have to deal with your filesystem going sour on you when two
files suddenly alias, because there is a new revision of the mapping
tables.

Case seemed simple when we were dealing with the "let's teach them all
English" world, but even when you're dealing with languages like
German (Ã) or Dutch (Ä) things get fuzzy... what's worse, in
Turkish the uppercase equivalent of "i" (U+0069) isn't "I" (U+0049),
it's "Ä" (U+0130)! There is no table which can tell you that, since
it's context-dependent. Thus, you may now need to consider larger
equivalence classes, but is the other user expecting the same thing?
You can't just use the same base letter being equivalent everywhere,
or a Swedish user would beat the sh*t out of you for confusing the
words "vas" and "vÃs". On the other hand, the Swedish user would be
perfectly happy having "Ã" equivalent with "Ã" and "Ã" equivalent
with "y"!

Therein lies madness.

-hpa



--
PGP public key available - finger hpa@xxxxxxxxx
Key fingerprint: 2047/2A960705 BA 03 D3 2C 14 A8 A8 BD 1E DF FE 69 EE 35 BD 74
"The earth is but one country, and mankind its citizens." -- Bahá'u'lláh
Just Say No to Morden * The Shadows were defeated -- Babylon 5 is renewed!!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/