Re: UTF-8, OSTA-UDF [why?], Unicode, and miscellaneous gibberish

Peter Holzer (hjp@wsr.ac.at)
Wed, 20 Aug 1997 17:24:07 +0200 (MESZ)


Alex Belits wrote:
>But if the only alternatives will be UTF-8 or "no translation at all",
>that will leave only UTF-8 usable -- taking plain ASCII filename in the
>form how it's stored on NTFS (16-bt Unicode) produces a string,
>unusuitable for any string processing. IMHO if one wants to support such a
>thing, replaceable name-translation interfaces should be used, not
>hardcoded UTF-8.

Ok. "No translation at all" is certainly not an option for any file
system where '\0' and '/' are not represented by the single bytes 0 and
2F.

Here are my thoughts how this should be handled (without any references
to current Linux code - it is a long time since I actually looked at the
Linux filesystem code)

There should be some translation between the file system and VFS layer.
The file system could use IBM 437, ISO-8859-x, Unicode, Radix-50,
EBCDIC, or whatever encoding the inventor of the file system deemed
appropriate, but the VFS layer should use a single standardized
encoding. This translation would also handle case conversion for file
systems which are case insensitive and would reject file names which
cannot be represented). This translation may be fixed for some file
systems or configurable with a mount option for others (probably most).

There should be another translation layer between the VFS and the
application programs. This one should be based on the locale, because
one user might prefer latin-1, another koi-8r, a third iso-2022-jp and
the fourth utf-8. Therefore, I think this conversion belongs into libc.

Note that both the VFS encoding and the encoding in the file system are
almost completely hidden from the user (he will just have problems when
either a file name contains a character not in his current locale, or
when he tries to create a file with a character not representable on the
file system), and that the user's locale is completely hidden from the
file system. The representation used by the VFS layer is completely
irrelevant to either of them as long as it can represent all characters
of all locales and filesystems. For compatibility with old libc's it
should preserve the semantics of '\0' and '/'. I think UTF-8 is a good
choice for this.

hp

--
   _  | Peter J. Holzer             | If I were God, or better yet
|_|_) | Sysadmin WSR                | Linus, I would ...
| |   | hjp@wsr.ac.at               |     -- Bill Davidsen
__/   | http://wsrx.wsr.ac.at/~hjp/ |        (davidsen@tmr.com)