Re: NLS: utf8 conversions

From: Clemens Ladisch
Date: Mon Apr 27 2009 - 04:09:54 EST


Alan Stern wrote:
> Although nobody seems to have made a big deal about it, the conversions
> between utf8 and utf16 done by fs/nls/nls_base.c are wrong in a couple
> of important respects:
>
> They don't handle Unicode code points larger than U+FFFF.
>
> They don't detect invalid values, in particular, surrogate
> code points.
>
> The problems stem from the fact the characters at issue can't be
> represented by a single 16-bit wchar_t. But that's no excuse for
> performing an incorrect conversion to or from utf16.
>
> Are there any definite thoughts on how this should be handled? I don't
> see any way for the single-character conversion routines (utf8_mbtowc
> and utf8_wctomb) to come to grips with these issues, except perhaps for
> returning an error when a character would be invalid or too big to fit
> in 16 bits.
>
> The string-oriented routines (utf8_mbstowcs and utf8_wcstombs) could be
> adapted to deal with these issues properly.
>
> Any comments or suggestions for other approaches?

The single-character utf8_* routines in nls_base.c are just special
cases of the NLS API for the UTF-8 encoding; the string-oriented
routines, as far as I can see, are actually only used to do conversions
between UTF-8 and UTF-16, not wchar_t, so they probably should be
renamed.

As for the NLS API itself: If we want to be able to handle code points
larger than U+FFFF, the obvious answer is to make wchar_t a 32-bit type.
This should not be too large a problem because the FS NLS API is
designed so that wchar_t is only used for temporary values, i.e.,
characters are converted from some on-disk encoding to wchar_t, then
from wchar_t to some I/O encoding (usually UTF-8); and the conversions
are done one code point at a time.

The file systems that use some form of UTF-16 (VFAT, NTFS, CIFS, UDF,
etc.) use the NLS API in a different way: they treat the individual
UTF-16 values as wchar_t values and do only the conversion from wchar_t
to the I/O encoding. Here we'd need to introduce an additional
conversion step between UTF-16 and wchar_t, i.e., treat UTF-16 like any
other multibyte encoding.


Best regards,
Clemens
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/