Re: UTF-8 practically vs. theoretically in the VFS API

From: Helge Hafting
Date: Tue Feb 17 2004 - 06:08:13 EST


pcg( Marc)@goof(A.).(Lehmann )com wrote:
On Mon, Feb 16, 2004 at 02:40:25PM -0800, Linus Torvalds <torvalds@xxxxxxxx> wrote:

Try it with a regular C locale. Do a simple

echo > åäö


Just for your info, though. You can't even input these characters in a C
locale, since your libc (and/or xlib) is unable to handle them (lots of SO
C functions will barf on this one). C is 7 bit only.


Which, if you think about is, is 100% EXACTLY equivalent to what a UTF-8
program should do when it sees broken UTF-8.


The problem is that the very common C language makes it a pain to use
this in i18n programs. multibyte functions or iconv will no accept
these, so programs wanting to do what you are expecting to do need to
re-implement most if not all of the character handling of your typical
libc.

Yes, it's possible....

All you need is a possible_garbage_to_properly_escaped_utf8(char *string)
in libc. Any program that wants to display filenames it got
straight from readdir (or any binary file contents) will simple feed
the string through that and get back a string with
escapes for anything that isn't utf8. It is a write-once, use
everywhere thing.

Once up on a time, there were serious problems when someone created
filenames like "; rm -fr *" Today we use tab completion
and get bash to present the filename with proper escapes. It is then harmless.
Bad utf8 can be handled the same way.

The "bit" is enourmous, as you can't use your libc for text processing
anymore.

Not the current libc, but libc can be improved upon. The same happened to
silly code that weren't 8-bit clean.

Helge Hafting

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/