Re: UTF-8, OSTA-UDF [why?], Unicode, and miscellaneous gibberish

Alex Belits (abelits@phobos.illtel.denver.co.us)
Tue, 26 Aug 1997 23:17:22 -0700 (PDT)


On Wed, 27 Aug 1997, Michael Poole wrote:

> > anyway without user-space help.
>
> ext2 currently supports UTF-8 as an encoding; unless you have
> characters outside the ASCII range, it's also using UTF-8, since it was
> designed to preserve that range.

This is rather backward... UTF-8 is designed to allow the use of
Unicode filenames with 8-bit-zero or 8-bit-slash inside in filesystems
that don't allow them. It's like claim "ext2 already supports
BASE64 encoding" or "Red Square already supports M1 tanks, why not to
place them there?"

> For filenames, as long as we don't want
> the kernel to ensure that only an integral number of variable-width
> characters are stored, I agree with you: the kernel doesn't need to know
> about the external encoding, and shouldn't know about it.
> However, my personal belief is that there should be a policy in
> the kernel to only allow whole characters to be stored;

What if it's not characters, and, say, phone numbers as filenames - they
are also never are supposed to be truncated? Words, like
"this-is-my-file"? Should it be disallowed to truncate "14/v\0|~5p33|<" as
"14/v\0|~5p33|" because it's a word "lamerspeak" in the "language" of the
same name, and it breaks the letter "K"? IMNSHO kernel should not handle
such things.

> in this case the
> kernel will need to know what encoding is used for file names. I strongly
> suspect that this won't be implemented, though, due to either standards
> compliance or for the benefit of supporting multiple encodings. There is
> a very strong case to be made that character delineation should be left to
> user-space, and if that's what prevails, so be it; I think that libc
> should be able to implement the policy just as well as the kernel.

Even libc is not a right place in a lot of situations, but kernel
definitely is the most wrong one.

--
Alex