Re: unicode (char as abstract data type)

Richard B. Johnson (root@chaos.analogic.com)
Tue, 21 Apr 1998 15:48:47 -0400 (EDT)


On Tue, 21 Apr 1998, Steve VanDevender wrote:
[SNIPPED]
>
> Unicode is a _character set_. That is, it is a set of numeric
> encodings for a set of symbols used in writing nearly all the
> languages in the world today.
>
[SNIPPED]
> It also means that
> low-level applications (like the Linux kernel) can store any of
> these characters and leave the hairy aspects of language-specific
> intepretation to applications. If you want to do language
> tagging in the Linux kernel itself, just so you can continue to
> use your beloved koi-8 character set, you're introducing a huge
> amount of potential bloat that doesn't belong there.

What we need in an Operating System is a method of storing
and retrieving information. How that information is obtained
and used is not, and must not, be specific to an Operating
System.

Efficient Operating Systems store and retrieve information
in the natural message (information) units of the hardware.
This means that efficient Operating Systems are not portable.
Portability exists at the Applications Programming Interface.

Translations of information to and from the natural message
units previously had been the domain of interface programs
including Applications, Databases, and Shells.

Recent versions of Operating Systems have attempted to put
"Human readable" information into file-systems. Since file-
systems exist within Operating Systems, problems occur
determining what "Human readable" really means. If file-systems
never had such text in their file-names, this would not be
a problem.

Early file-systems had integers to identify files. File readers
such as Database programs, and directory programs would translate
information contained within such files to "Human readable"
context. This was the idea behind the "Container File". The
fact that ASCII was used as the last link to the human interface
is irrelevant. If a particular language needs 128-bit characters
(message units), a container file for that language would be
16 times larger than one that only needed 8-bit characters.
Nothing else would change.

If it requires N bits per message unit to handle translations
to and from various Human Languages, these bits should not
mean anything to an Operating System. They should just be part
of its natural data stream.

If your User Interface language of choice can communicate
using only say, X bits, the Operating System should not
have to process N bits per message unit, from which it
extracts the X bits you want. Instead, the Operating System
should only have to process the information necessary to
save or restore X bits.

Attempts to put "Human readable" stuff within an Operating
System will eventually fail. The correct place for such
an operation is within the application.

That said, the translation interface may well be part of
a shared library or similar API so that translation may
become essentially transparent even at the Application
Level.

Unicode only goes part way towards fixing a problem that
should have never occurred in the first place. As Operating
Systems mature, we should not lose track of their essential
functions and certainly should not attempt to make them
"human".

Cheers,
Dick Johnson
***** FILE SYSTEM MODIFIED *****
Penguin : Linux version 2.1.92 on an i586 machine (66.15 BogoMips).
Warning : It's hard to remain at the trailing edge of technology.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu