Re: Unicode, etc. solution

H. Peter Anvin (hpa@transmeta.com)
27 Aug 1997 09:51:33 GMT


Followup to: <199708270829.EAA27601@cad2.cs.uml.edu>
By author: "Albert D. Cahalan" <acahalan@cs.uml.edu>
In newsgroup: linux.dev.kernel
>
> The reason is simple: If the kernel assumes you use Latin-1
> for both the filesystem and console, it won't mangle anything.
> This is not exactly the same as ignoring the characterset!
> It is a nice hack for 1-language 8-bit systems.
>

Nope, it's not. It's a much better idea to make sure the right tables
KOI-8 -> Unicode -> font are loaded. The first translation of these
still need work.

> UTF-8 is not quite as bad as many native multi-byte encodings,
> but it is still really bad. It can be used to store Unicode
> filenames on a hostile network filesystem. It is also a weak
> kind of compression for systems that are 95% ASCII and 5%
> of some mixed/large language(s). Other than that, forget it.

What do you mean "still really bad"? I strongly disagree with that
statement, I think it is the preferred form for interchange.

> Next, the kernel must translate filenames. I want to put your
> KOI-8 floppy in my system and read it the right way as well
> as I can. If I convert to full Unicode, I want to read every
> filesystem I can find. This requires a mount option for every
> filesystem with a poorly defined character set.

This is exactly the wrong thing to do. We *DON'T* want this kind of
crap in the system. If so, we're much better off standardizing on
Unicode. Otherwise the kernel has to know about every bloody
character set in existence -- this is completely utterly intolerable.

> That leaves only the kernel API. The standard way of fixing
> an API will do quite well: alternate system calls for raw
> 16-bit Unicode. Only the calls that take/return 8-bit text
> need alternates. The old calls do _not_ get depreciated, at
> least not much. They need to use plain 8-bit (not multi-byte)
> text and remain that way for the next 30 years at least.
> For the new API, pick the byte order with Java and vfat in mind.
> The '/' and '\0' are safe: the kernel uses 16-bit versions.

Great; you do know that Java and VFAT use opposite byte order, right?

This is the wrong thing to do. Use UTF-8 encoding as the multibyte
set, and do conversion to wide characters if you want to. The Asians
are -- for good reason -- already screaming bloody murder over 16
bits; either we end up using an awful kluge like UTF-16, or we stick
to 8-bit bytes and use UTF-8, which handles all of UCS-4 quite
elegantly.

Backward compatibility with 8 bits and forward compatibility with > 16
bits (planes 1 and 2 in ISO 10646 are already being defined) is what
leads me to say that UTF-8 is the way to go.

-hpa

-- 
    PGP: 2047/2A960705 BA 03 D3 2C 14 A8 A8 BD  1E DF FE 69 EE 35 BD 74
    See http://www.zytor.com/~hpa/ for web page and full PGP public key
Always looking for a few good BOsFH.  **  Linux - the OS of global cooperation
        I am Baha'i -- ask me about it or see http://www.bahai.org/