UTF-8, OSTA-UDF [why?], Unicode, and miscellaneous gibberish

Teunis Peters (teunis@usa.net)
Mon, 18 Aug 1997 21:20:43 -0600 (MDT)


On Tue, 12 Aug 1997, Andrew E. Mileski wrote:

> > > The OSTA-UDF(tm) filesystem I'm working on supports compressed Unicode.
> > > Basically, the first byte is a flag inidcating how to expand the following
> > > bytes:
> > > 8 = high byte is 0 and low byte is from data stream
> > > 16 = high byte is followed by low byte in the data stream
> > > By the ISO standards, this is CS0 or a character set defined by agreement.
> >
> > Not yet another Unicode encoding format?! What's wrong with
> > UTF-8? Not Invented Here?
>
> UTF-8 maps Unicode to a font as Unicode does not specify how a character
> appears, but rather Unicode differentiates characters from each other.

<pardon delay - was away fer a week... so don't mind if this is the
umpteenth answer>

Unicode : A set of encoding standards for storing international
characters, yes?

UTF-7 : a 7-bit way of encoding Unicode (and _ONLY_ Unicode)
UTF-8 : a 8-bit ...
UTF-16: a 16-bit
UTF-32: The full Unicode (AFAIK)

There's 64K tables of 64K characters. What's commonly encountered is only
one table of Unicode characters (this is NOT UTF-16)

AFAIK only 3 tables have been defined - the basic Unicode set (64K) and
two tables for such large symbolic languages as Chinese....

If you want bit encodings for UTF-8 I can provide them (but
http://www.unicode.org is a better place to look :)

There is NO font information anywhere in any of this mess.... Just a
standard 'this character == this Unicode value'....

Beyond that the Chinese still (AFAIK) decided whether or not to actually
USE unicode [the language has other ways of creating new characters - this
is not something computers are good at handling], Unicode has largely been
accepted [mostly by fiat].

Personally I think Unicode is a really good idea... I like the idea of
being able to put descriptive filenames in files.
sometimes the native language [eg Japanese] is the only way to describe a
file.

Not that it matters but I think as long as filenames from 16bit+
filesystems should be encoded into UTF-8 before being passed to the user.

So what filesystems are dependant on what character set?

FAT : 8-bit IBM-PC
VFAT : 16-bit Unicode
ext-2 : Latin-1? (though UTF-8 is supported)
HPFS : global translation table (8bit -> 16bit Unicode?)
NTFS : as HPFS I think

This is all I know of....

Sure would be nice to be able to emulate ioctl's on existing devices BTW -
being able to COMPLETELY emulate a console could be valuable <g>....
[I have a graphical console that emulates everything except ioctl's...]
(though 'kon' is faster <sigh>)