Re: UTF-8 and case-insensitivity

From: tridge
Date: Tue Feb 17 2004 - 23:14:37 EST


Hpa,

> So you're hosed if anyone uses characters outside the UCS-2 character
> set...

I've heard they are re-defining all those 16 bit numbers to be UCS-16
instead of UCS-2 for exactly that reason. This is rather similar to
the move in the Unix community to start using UTF-8.

Note that I am not at all proposing that we use UCS-2 in the Linux
kernel (except in places where you have to, like the NTFS
filesystem). I am proposing that the filesystems be able to offer a
case-insenstive hash function to the dcache, and I would expect that
this function would be based on UTF-8.

The function might operate internally by converting UTF-8 to UCS-2, or
it might use a sparse mapping table. It would almost certainly have a
fast-path that looked first to see if there are any bytes with the top
bit set, and if there are none then it can do a really easy 7 bit
table based hash which would make this really fast for most users.

The point is that the kernel proper (the VFS and dcache in particular)
won't have to care how this hash works. They're just consumers of it.

> There is a "standard" table, which is published by the Unicode
> consortium.

The table used in windows is not exactly the same as the one on
unicode.org. Which is "correct" I will leave up to the pedants to
discuss, as all that Samba cares about is that it uses the same table
as w2k.

> However, the "standard" table isn't what you want in certain
> locales, e.g. Turkish.

I'd really like someone to confirm this for me by volunteering to run
a tool I provide on a Turkish NTFS filesystem or sending me a
compressed empty Turkish NTFS volume (please ask first by email - I
only need one of these). Up to now I have only ever seen the one 128k
table used across all windows locales. If this table really *is*
different in some locales then I need to know.

Cheers, Tridge
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/