Re: unicode (char as abstract data type)

Theodore Y. Ts'o (tytso@MIT.EDU)
Fri, 17 Apr 1998 17:43:00 -0400


From: alan@lxorguk.ukuu.org.uk (Alan Cox)
Date: Fri, 17 Apr 1998 20:42:51 +0100 (BST)

> UNICODE is more then just irritating. The problem is that the programming
> language thinks in terms of char* text. You start using wchar_t and before
> you know it, you have a huge mess and you just can't seem to get the types
> quite right anymore.

That is why UTF8 is the right format to use in real situations. UTF8
works just like ascii in memory handling respects - its just that
x++ is no longer always move on one char and strlen(x) isnt the right
answer

There's one problem with UTF-8 (or rather Unicode in general), which
recently cropped up when people tried to use them in X.509 Certificates,
which is cannonical encodings.

For example, there are (at least) two ways to encode a u with an umlaut
character. This means that if someone creates a filename using the one
unicode symbol version of u with an umlaut, and someone else tries to do
a lookup. There are ways of creating an canonical encoding using the
characters in their fully decomposed form. However, there appears to be
no algorithmic way to do this. You have to have large tables to
canonicalize Unicode, and the tables will change over time as new
characters get added to Unicode.

- Ted

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu