Re: Eliminating UDF iocharset!=utf8 code (Re: [PATCH 6/8] Supportnon-BMP characters in UDF)

From: Jan Kara
Date: Thu May 17 2012 - 15:45:37 EST


On Thu 17-05-12 17:30:32, Vladimir 'Ï-coder/phcoder' Serbinenko wrote:
> On 17.05.2012 16:40, Jan Kara wrote:
> > On Thu 17-05-12 02:48:49, Vladimir 'Ï-coder/phcoder' Serbinenko wrote:
> >>
> >>> I've noticed another duplication in the UDF code: there
> >>> is NLS support and separate UTF-8 support. UTF-8 is support by 2 ways
> >>> actually: with -o utf8 and -o iocharset=utf8 which imply different
> >>> codepaths. Specific UTF-8 support is probably slightly faster by
> >>> avoiding calls and basically doing everything with shifts (or can be
> >>> made so with a small patch). Should I perhaps kill one of them? Is
> >>> iocharset!=utf8 still of any importance? I haven't seen it in ages.
> >>> Perhaps we could keep just the performant UTF-8 support and map
> >>> iocharset=utf8 to it and drop iocharset!=utf8? iocharset!=utf8 probably
> >>> has no users anyway so keeping it we're likely to keep bugs and code
> >>> duplication with no benefit.
> >>>
> >>
> >> Linux seems to support UTF-8-only pretty strongly: http://yarchive.net/comp/linux/utf8.html
> >> (message from Sun, 15 Feb 2004 02:42:45 GMT).
> >> And I completely agree.
> >> If it's ok to kill iocharset!=utf8 I'll propose a series of 3 patches (killing iocharset!=utf8,
> >> extending utf16toutf8/utf8toutf16 for unaligned input, changing UDF code to use common functions)
> > Well, yes, utf8 is currently the only sane setting but that doesn't mean
> > someone isn't using (e.g. iso8859-2) for strange reasons...
>
>
> What would be the correct behaviour if we encounter the characters which
> can't be represented in the given charset? Currently the code replaces
> them with question marks but since this doesn't complete round trip
> successfully someone attempting to open or stat the file by name won't
> be able to. So these files become pretty much "ghosts" that you see but
> can't do anything with them.
Yeah. So maybe we can just pass the bytes encoding such characters
further? Sure the names would look awkward but at least they would be some
names to use. I don't say it's ideal but it's at least some sensible way...
But that's a separate question from our current discussion AFAICT. Also so
far noone has complained about the question marks either so if someone is
using iocharset, he probably knows what he is doing ;). So I don't think
fixing this is really important.

> Hiding them altogether would lead to
> situations when the disk appears empty but df shows that it's 100% full.
> While encodings like iso-8859-1 are relatively straightforward, some
> other (East Asian) encodings may produce '/' as part of another
> character and so confuse the kernel. Such encodings are also stateful
> and I'm pretty sure that current code bugs on them.
> I don't know if these quirks can be used to make a program load a file
> it wasn't intended to and whether it's of any security concern.
> I'm aware of bash security problems with such characters when part of
> Chinese character is interpreted as backtick.
> I don't think that these problems can create a security hole on kernel
> side, they can be used to confuse userspace but I doubt it's anything
> exploitable but it's something I'd be doubtful about.

Honza
--
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/