Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]

From: Jann Horn
Date: Tue Mar 03 2020 - 11:58:28 EST


On Tue, Mar 3, 2020 at 5:51 PM Greg Kroah-Hartman
<gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
> On Tue, Mar 03, 2020 at 03:40:24PM +0100, Jann Horn wrote:
> > On Tue, Mar 3, 2020 at 3:30 PM Greg Kroah-Hartman
> > <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
> > > On Tue, Mar 03, 2020 at 03:10:50PM +0100, Miklos Szeredi wrote:
> > > > On Tue, Mar 3, 2020 at 2:43 PM Greg Kroah-Hartman
> > > > <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
> > > > >
> > > > > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> > > >
> > > > > > If buffer is too small to fit the whole file, return error.
> > > > >
> > > > > Why? What's wrong with just returning the bytes asked for? If someone
> > > > > only wants 5 bytes from the front of a file, it should be fine to give
> > > > > that to them, right?
> > > >
> > > > I think we need to signal in some way to the caller that the result
> > > > was truncated (see readlink(2), getxattr(2), getcwd(2)), otherwise the
> > > > caller might be surprised.
> > >
> > > But that's not the way a "normal" read works. Short reads are fine, if
> > > the file isn't big enough. That's how char device nodes work all the
> > > time as well, and this kind of is like that, or some kind of "stream" to
> > > read from.
> > >
> > > If you think the file is bigger, then you, as the caller, can just pass
> > > in a bigger buffer if you want to (i.e. you can stat the thing and
> > > determine the size beforehand.)
> > >
> > > Think of the "normal" use case here, a sysfs read with a PAGE_SIZE
> > > buffer. That way userspace "knows" it will always read all of the data
> > > it can from the file, we don't have to do any seeking or determining
> > > real file size, or anything else like that.
> > >
> > > We return the number of bytes read as well, so we "know" if we did a
> > > short read, and also, you could imply, if the number of bytes read are
> > > the exact same as the number of bytes of the buffer, maybe the file is
> > > either that exact size, or bigger.
> > >
> > > This should be "simple", let's not make it complex if we can help it :)
> > >
> > > > > > Verify that the number of bytes read matches the file size, otherwise
> > > > > > return error (may need to loop?).
> > > > >
> > > > > No, we can't "match file size" as sysfs files do not really have a sane
> > > > > "size". So I don't want to loop at all here, one-shot, that's all you
> > > > > get :)
> > > >
> > > > Hmm. I understand the no-size thing. But looping until EOF (i.e.
> > > > until read return zero) might be a good idea regardless, because short
> > > > reads are allowed.
> > >
> > > If you want to loop, then do a userspace open/read-loop/close cycle.
> > > That's not what this syscall should be for.
> > >
> > > Should we call it: readfile-only-one-try-i-hope-my-buffer-is-big-enough()? :)
> >
> > So how is this supposed to work in e.g. the following case?
[...]
> > int maps = open("/proc/self/maps", O_RDONLY);
> > static char buf[0x100000];
> > int res;
> > do {
> > res = read(maps, buf, sizeof(buf));
> > } while (res > 0);
> > }
[...]
> >
> > The kernel is randomly returning short reads *with different lengths*
> > that are vaguely around PAGE_SIZE, no matter how big the buffer
> > supplied by userspace is. And while repeated read() calls will return
> > consistent state thanks to the seqfile magic, repeated readfile()
> > calls will probably return garbage with half-complete lines.
>
> Ah crap, I forgot about seqfile, I was only considering the "simple"
> cases that sysfs provides.
>
> Ok, Miklos, you were totally right, I'll loop and read until the end of
> file or buffer, which ever comes first.

I wonder what we should do when one of the later reads returns an
error code. As in, we start the first read, get a short read (maybe
because a signal arrived), try a second read, get -EINTR. Do we just
return the error code? That'd probably work fine for most usecases -
e.g. if "top" is reading stuff from procfs, and that gets interrupted
by SIGWINCH or so, it doesn't matter that we've already started the
first read; the only thing "top" really needs to know is that the read
was a short read and it has to retry.