Re: [PATCH] console UTF-8 fixes

From: Jan Engelhardt
Date: Wed Apr 11 2007 - 15:03:44 EST

Next message: David Howells: "[PATCH 0/8] AFS: Add security support and fix bugs"
Previous message: Mathieu Desnoyers: "[PATCH] Linux Kernel Markers documentation fix typo and use ARRAY_SIZE"
In reply to: Roman Zippel: "Re: [PATCH] console UTF-8 fixes"
Next in thread: Egmont Koblinger: "Re: [PATCH] console UTF-8 fixes"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Apr 11 2007 20:28, Egmont Koblinger wrote:

>I send a reworked version of the patch.
>
>Removed from the first version:
> - any sign of '.' as substitute glyph
> - don't ignore zero-width characters (except for a few zero-width spaces
> that are ignored in the current kernel too). However, I kept the code
> organized and commented so that someone can have the other behavior very
> easily (by removing a pair of comment signs).
>
>Kept features, fixes:
> - lots of UTF-8 decoder fixes. Emit one U+FFFD for every standalone
> continuation byte and for every incomplete sequence, as Markus Kuhn
> recommends. Reject overlong sequences too.
> - D800..DFFF and FFFE..FFFF are substituted by FFFD too, since these are
> not valid Unicode code points.
> - no "random" replacement glyph (e.g. u with double acute instead of
> u with circumflex) in UTF-8 mode
> - if U+FFFD is not found in the font, the fallback replacement '?' (ascii
> question mark) is printed with inverse color attributes
> - U+200A was ignored so far as a zero-width space character. I think it
> was a mistake, it's not zero-width.
> - print an extra space for double-wide characters for the cursor to stand
> at the right place. Yet again the code is organized so that it is very
> easy to change to jump only one character cell, should someone prefer
> that behavior (which I still see no good reason to).
>
>Signed-off-by: Egmont Koblinger <egmont@xxxxxxxxxxx>
>
>@@ -1934,6 +1943,99 @@
> char con_buf[CON_BUF_SIZE];
> DECLARE_MUTEX(con_buf_sem);
>
>+/* is_{zero,double}_width() are based on the wcwidth() implementation by
>+ * Markus Kuhn -- 2003-05-20 (Unicode 4.0)
>+ * Latest version: http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
>+ */
>+struct interval {
>+ int first;
>+ int last;
>+};

CodingStyle? uint16_t instead of int?

>+static int is_zero_width(long ucs)
>+{
>+ static const struct interval zero_width[] = {
>+ { 0x0300, 0x0357 }, { 0x035D, 0x036F }, { 0x0483, 0x0486 },
[...]
>+ { 0xFB1E, 0xFB1E }, { 0xFE00, 0xFE0F }, { 0xFE20, 0xFE23 },
>+ { 0xFEFF, 0xFEFF }, { 0xFFF9, 0xFFFB }, { 0x1D167, 0x1D169 },
>+ { 0x1D173, 0x1D182 }, { 0x1D185, 0x1D18B }, { 0x1D1AA, 0x1D1AD },
>+ { 0xE0001, 0xE0001 }, { 0xE0020, 0xE007F }, { 0xE0100, 0xE01EF }
>+ };

Since Unicode above 0xFFFF is unsupported, could not these entries be killed?

>+static int is_double_width(long ucs)
>+{
>+ static const struct interval double_width[] = {
>+ { 0x1100, 0x115F }, { 0x2329, 0x232A }, { 0x2E80, 0x303E },
>+ { 0x3040, 0xA4CF }, { 0xAC00, 0xD7A3 }, { 0xF900, 0xFAFF },
>+ { 0xFE30, 0xFE6F }, { 0xFF00, 0xFF60 }, { 0xFFE0, 0xFFE6 },
>+ { 0x20000, 0x2FFFD }, { 0x30000, 0x3FFFD }
>+ };

Similarly.

>@@ -1950,6 +2052,10 @@
> unsigned int currcons;
> unsigned long draw_from = 0, draw_to = 0;
> struct vc_data *vc;
>+ unsigned char vc_attr;
>+ int rescan;
unsigned int rescan:1;
>+ int inverse;
unsigned int inverse:1;
>+ int width;
unsigned int width; or even uint8_t.

> u16 himask, charmask;
> const unsigned char *orig_buf = NULL;
> int orig_count;

>@@ -2012,51 +2118,81 @@
> buf++;
> n++;
> count--;
>+ rescan = 0;
>+ inverse = 0;
>+ width = 1;
>
> /* Do no translation at all in control states */
> if (vc->vc_state != ESnormal) {
> tc = c;
> } else if (vc->vc_utf && !vc->vc_disp_ctrl) {
>- /* Combine UTF-8 into Unicode */
>- /* Malformed sequences as sequences of replacement glyphs */
>+ /* Combine UTF-8 into Unicode in vc_utf_char */
>+ /* vc_utf_count is the number of continuation bytes still expected to arrive */
>+ /* vc_npar is the number of continuation bytes arrived so far */
> rescan_last_byte:
>- if(c > 0x7f) {
>+ if ((c & 0xc0) == 0x80) {
>+ /* Continuation byte received */
>+ static const int utf8_length_changes[] = { 0x0000007f, 0x000007ff, 0x0000ffff, 0x001fffff, 0x03ffffff, 0x7fffffff };

I would not mind unsigned.

Jan
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: David Howells: "[PATCH 0/8] AFS: Add security support and fix bugs"
Previous message: Mathieu Desnoyers: "[PATCH] Linux Kernel Markers documentation fix typo and use ARRAY_SIZE"
In reply to: Roman Zippel: "Re: [PATCH] console UTF-8 fixes"
Next in thread: Egmont Koblinger: "Re: [PATCH] console UTF-8 fixes"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]