Re: Unicode, etc. solution

Albert Cahalan (acahalan@lynx.dac.neu.edu)
Wed, 27 Aug 1997 07:16:48 -0400 (EDT)


H. Peter Anvin writes:

>> The reason is simple: If the kernel assumes you use Latin-1
>> for both the filesystem and console, it won't mangle anything.
>> This is not exactly the same as ignoring the characterset!
>> It is a nice hack for 1-language 8-bit systems.
>
> Nope, it's not. It's a much better idea to make sure the right
> tables KOI-8 -> Unicode -> font are loaded. The first translation
> of these still need work.

While having the right tables is better, the other method works.
It is an easy default, which makes it a nice hack.

>> UTF-8 is not quite as bad as many native multi-byte encodings,
>> but it is still really bad. It can be used to store Unicode
>> filenames on a hostile network filesystem. It is also a weak
>> kind of compression for systems that are 95% ASCII and 5%
>> of some mixed/large language(s). Other than that, forget it.
>
> What do you mean "still really bad"? I strongly disagree with that
> statement, I think it is the preferred form for interchange.

With UTF-8, repeated conversions are unavoidable and somewhat complex.
Most apps will _severely_ mishandle text on a UTF-8 system. At least
with raw Unicode you know that newer apps will operate without
overhead and older apps won't split UTF-8 characters. At worst you
have a byte order swap.

> Next, the kernel must translate filenames. I want to put your
>> KOI-8 floppy in my system and read it the right way as well
>> as I can. If I convert to full Unicode, I want to read every
>> filesystem I can find. This requires a mount option for every
>> filesystem with a poorly defined character set.
>
> This is exactly the wrong thing to do. We *DON'T* want this
> kind of crap in the system. If so, we're much better off
> standardizing on Unicode. Otherwise the kernel has to know
> about every bloody character set in existence -- this is
> completely utterly intolerable.

It is funny to see that from you, because I think you had something
to do with loadable translation tables for the console. Do you also
find that completely utterly intolerable? There are already several
reimplementations of it for filesystems. Wouldn't it be better if
they could share the same code and translation tables?

Also, this _is_ standardizing on Unicode. Unicode is the middle
layer, plus it gets exposed to Unicode filesystems and applications.

>> That leaves only the kernel API. The standard way of fixing
>> an API will do quite well: alternate system calls for raw
>> 16-bit Unicode. Only the calls that take/return 8-bit text
>> need alternates. The old calls do _not_ get depreciated, at
>> least not much. They need to use plain 8-bit (not multi-byte)
>> text and remain that way for the next 30 years at least.
>> For the new API, pick the byte order with Java and vfat in mind.
>> The '/' and '\0' are safe: the kernel uses 16-bit versions.
>
> Great; you do know that Java and VFAT use opposite byte order, right?

I'd suspected so. That means Linux can pick either one or always
use native byte order.

> This is the wrong thing to do. Use UTF-8 encoding as the
> multibyte set, and do conversion to wide characters if you
> want to.

Since the conversion is not cheap and UTF-8 breaks everything
anyway, we might as well do this the Right Way with 16-bit
characters all accross the API. The old calls must remain
as single-byte encoded for normal apps.

> The Asians are -- for good reason -- already screaming bloody
> murder over 16 bits; either we end up using an awful kluge
> like UTF-16, or we stick to 8-bit bytes and use UTF-8, which
> handles all of UCS-4 quite elegantly.

Normal everyday "characters" fit in 16 bits. Since there are
more characters every day, they can't all go into halfway
portable filenames anyway. This is why word processors and HTML
let you embed an image of <img src="foo.gif"> as needed.