Re: unicode

Guest section DW (dwguest@win.tue.nl)
Tue, 19 May 1998 10:26:18 +0200 (MET DST)


From tytso@mit.edu Tue May 19 06:46:33 1998

Date: Sat, 16 May 1998 13:04:00 +0200 (MET DST)
From: dwguest@win.tue.nl (Guest section DW)

Just for your entertainment, have you read
POSIX (ISO/IEC 9945-1: 1996) B.2.3.4 (5)?
(Don't be afraid, it is not a prescription, it is just a
discussion about common usage, where it is remarked that
many Unix systems use filenames in several character sets,
sometimes even a single filename uses several character sets.

Yes, and in B.2.2.2, lines 1024--1030, it states that use of character
sets beyond "the portable character set or ISO/IEC 646" is "common", but
"technically noncompliant".

How biased and misleading a way of quoting. Instead it says,

"Situations where characters beyond the portable filename character set
(or historically ASCII or ISO/IEC 646) would be used are expected to be
common. Although such a situation renders the use technically noncompliant,
mutual agreement between the users of an extended character set will
make such a use portable between those users. Such a mutual agreement
could be formalized as an optional extension to POSIX.1."

That is, there is a very positive attitude towards greater freedom
in the use of character sets. Understandably, for first of all
there is no need to make the user's life difficult by imposing
restrictions: nothing is gained, a lot is lost, and secondly
this portable character set only allows letters, digits and .-_
so that `lost+found' or `[' or `file.c,v' etc are noncompliant.

[Americans tend to underestimate the enormous cost in time
and money of a conversion. Every American would consider
a proposal to convert all filenames to EBCDIC ridiculous,
just impossible, but now that ASCII and UTF-8 happen to
coincide and Americans can convert for free, they talk
easily about the horrors they plan to inflict on the rest
of the world. Fortunately, for the time being, these plans
look like empty words.]

I'm certainly willing to allocate a bit in the directory entry to help
deal with the conversion issues with folks who have been using the
POSIX.1 non-compliant approach of just storing high-eight-bit characters
in their ext2 filesystems, so that we can distinguish between entries
where folks used the non-complaint-but-expedient approach of just using
their local character set, from directory entries using UTF-8 to encode
ISO/IEC 646 characters.

Are you going to rename `lost+found'? And change all shell scripts
containing `['? Forbid `backup~'?
Or are you going to allocate another bit in a directory entry
for those people who have just been storing characters outside
the portable character set in their filenames?

As far as I know, people who are doing this today aren't labeling their
filesystems;

Of course not. There is no uniform character set per filesystem.
Every user chooses what fits her best. And my Spanish left neighbour
has wishes rather unlike those of my Russian right neighbour.
They use the same filesystem.

they are just using some local character set. They are
certainly not storing multiple character sets in a single filename,
because there's no way to distinguish which character set to use,

Learn about ISO 2022.

and any such labelling scheme is certainly non-standard.

ISO 2022 is a standard.

And note that there does not exist a standard that requires you
to cause these troubles. I first believe that you are serious
about strict POSIX compliance when you change `lost+found' in
the e2fs utilities, and encourage everybody to fix their RCS
to avoid commas in the filename.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu