Re: [PATCH] console UTF-8 fixes

From: H. Peter Anvin
Date: Fri Apr 06 2007 - 15:43:27 EST


Egmont Koblinger wrote:

- If a certain (otherwise valid UTF-8) character is not found in the glyph
table, the current code does one of these two (depending on other
circumstances):

- Either it displays the replacement character U+FFFD, falling back to a
simple question mark. Note that the Unicode replacement character U+FFFD
is to be used for invalid sequences. However, it shouldn't necessarily
be used when replacing a valid but undisplayable character. Think of
Pango for example that renders these as four hex digits inside a square.
To be able to visually distinguish between illegal sequences and legal
but undisplayable characters, I think U+FFFD or the question mark are
bad choices. In fact, any symbol that may normally occur in the text is
a bad choice if is displayed simply. Hence I chose to display an
inverted dot.


I strongly disagree. First of all, you're changing the semantics of a 13-year-old API. The semantics of the Linux console is that by specifying U+FFFD SUBSTITUTION GLYPH in your unicode table, you have specified the fallback glyph.

What's worse, you've hard-coded the uses of specific visual representations. That is completely unacceptable.

- Another possible thing the current code may do (for latin1-compatible
characters) is to simply display the glyph loaded in that position.
Suppose I have loaded a latin2 font. In latin2, 0xFB is an "u with
double accent". An applications prints U+00FB, which is an "u with
circumflex". Since this glyph is not present in latin2, it cannot be
printed with the current font. Still, the current code falls back to
printing the glyph from the 0xFB position of the glyph table. Hence my
app asked to print "u with circumflex" but an "u with double accent"
appears on the screen. This is totally contrary to the goals of Unicode
and shouldn't ever happen.

When does that happen? That is clearly a bug.

- The replacement character for invalid UTF-8 sequences is U+FFFD, falling
back to a question mark. I've changed the fallback version to an inverted
question mark. This way it's more similar to the common glyph of U+FFFD,
and it's more trivial to the user that it's not a literal question mark
but rather some erroneous situation.

Brilliant. You've picked a fallback glyph which is unlikely to exist in all fonts. The whole point of falling back to ? is that it's an ASCII character, which means that if the font designer failed to designate a fallback glyph -- which is an error!!! -- there is at least some hope of conveying the error back to the user.

- Overlong sequences are not caught currently, they're displayed as if these
were valid representations. This may even have security impacts.

- Lone continuation bytes (section 3.1 of the UTF-8 stress test) are
currently displayed as some "random" glyphs rather than the replacement
character.

- Incomplete sequences (sections 3.2 and 3.3) emit no replacement character,
but rather cause the subsequent valid character to be displayed more
times(!).

These are valid issues.

- There's no concept of double-width characters. It's way beyond the scope
of my patch to try to display them, but at least I think it's important
for the cursor to jump two positions when printing such characters, since
this is what applications (such as text editors) expect. If the cursor
didn't jump two positions, applications would suffer from displaying and
refreshing problems, and editing some English letters that are preceded by
some CJK characters in the same line became a nightmare. With my patch an
inverted dot followed by an inverted space is displayed for double-width
characters so it's quite easy to see that they are tied together.

To be able to do CJK you need something like Kon anyway. This feels like bloat.

- There's no concept of zero-width characters (such as combining accents)
either. Yet again it's beyond the scope of my patch to properly handle
them. Instead of the current behavior (write a replacement character) I
just ignore them so that full-screen applications can keep track of the
cursor position correctly.

There is a concept of combining sequences. Anything else, I suspect it's better to let the user know that something bad is happening.

- I believe (at least I do hope) that my code is cleaner, more
straightforward, easier to understand, and is slightly better documented
than the current version. The current code doesn't separate UTF-8 decoding
and glyph displaying parts. I clearly separated them. First I perform
UTF-8 decoding (this emits U+FFFD for invalid sequences), then check for
the width of the resulting character, change it to U+FFFD if it's
unprintable (e.g. an UTF-16 surrogate), and finally comes the part that
does its best in displaying the character on the screen.

I hope you like it. :)

Please see above comments.

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/