Re: Migration of kernel interfaces to seq_files breaks pread()consumers

From: Andrew Morton
Date: Sat Jan 24 2009 - 21:20:27 EST


On Fri, 16 Jan 2009 23:51:35 -0800 (PST) Paul Turner <pjt@xxxxxxxxxx> wrote:

>
> (Specifically) Several interfaces under /proc have been migrated to use
> seq_files. This was previously observed to be a problem with VMware's
> reading of /proc/uptime. We're now running into the same problem on
> /proc/<pid>/stat; we have many consumers performing preads on this
> interface which break under new kernels.
>
> Reverting these migrations presents other problems and doesn't scale with
> everyones' pet dependencies over an abi that's been
> broken :(

We changed userspace-visible behaviour and broke real applications.
This is a serious matter. So serious in fact that your report has
languished without reply for a week.

Reverting those changes until we have a suitable reimplementation which
doesn't bust userspace is 100% justifiable.

In which kernel versions is this regression present?

What would a revert look like? Big and ugly or small and simple? Do
the original commits (which were they?) still revert OK?

> Part of the problem in implementing pread in seq_files is that we don't
> know know whether the read was issued by pread(2) or read(2). It's not
> nice to shoehorn this information down the stack. I've attached a
> skeleton patch which shows one way we could push it up (although something
> like a second f_pos would be necessary to make it maintain pread
> semantics against reads).
>
> One advantage of this style of approach is that it doesn't break on
> partial record reads. But it's a little gross at the same time.
>

Yes, that is a bit gross.

Does this patch actually 100% solve the problem, or is it a precursor
to some other fix or what? It's hard to comment sensibly if it's a
partial thing with no sign how it will be used.

> diff --git a/fs/read_write.c b/fs/read_write.c
> index 2fc2980..744094a 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -407,6 +407,16 @@ asmlinkage ssize_t sys_pread64(unsigned int fd, char __user *buf,
> ret = -ESPIPE;
> if (file->f_mode & FMODE_PREAD)
> ret = vfs_read(file, buf, count, &pos);
> + else if (file->f_mode & FMODE_SEQ_FILE) {
> + /*
> + * We break the pread semantic and actually make it
> + * seek, this prevents inconsistent record reads across
> + * boundaries.
> + */
> + vfs_llseek(file, pos, SEEK_SET);
> + ret = vfs_read(file, buf, count, &pos);
> + file_pos_write(file, pos);
> + }

Well yes, that's a userspace-visible wrong change too.

> fput_light(file, fput_needed);
> }
>
> diff --git a/fs/seq_file.c b/fs/seq_file.c
> index 3f54dbd..f8c5379 100644
> --- a/fs/seq_file.c
> +++ b/fs/seq_file.c
> @@ -50,6 +50,8 @@ int seq_open(struct file *file, const struct seq_operations *op)
>
> /* SEQ files support lseek, but not pread/pwrite */
> file->f_mode &= ~(FMODE_PREAD | FMODE_PWRITE);
> + file->f_mode |= FMODE_SEQ_FILE;
> +
> return 0;
> }
> EXPORT_SYMBOL(seq_open);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 5f7b912..c3b5916 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -76,6 +76,8 @@ extern int dir_notify_enable;
> behavior for cross-node execution/opening_for_writing of files */
> #define FMODE_EXEC 16
>
> +#define FMODE_SEQ_FILE_PREAD 32

-EWONTCOMPILE, btw.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/