Re: [patch] pipe: add support for shrinking and growing pipes

From: Jens Axboe
Date: Thu Jun 03 2010 - 07:13:04 EST


On 2010-06-03 13:00, Michael Kerrisk wrote:
> On Thu, Jun 3, 2010 at 12:56 PM, Michael Kerrisk
> <mtk.manpages@xxxxxxxxxxxxxx> wrote:
>> On Thu, Jun 3, 2010 at 12:38 PM, Jens Axboe <jaxboe@xxxxxxxxxxxx> wrote:
>>> On 2010-06-03 09:48, Michael Kerrisk wrote:
>>>> On Thu, Jun 3, 2010 at 9:05 AM, Michael Kerrisk
>>>> <mtk.manpages@xxxxxxxxxxxxxx> wrote:
>>>>> On Thu, Jun 3, 2010 at 9:01 AM, Jens Axboe <jaxboe@xxxxxxxxxxxx> wrote:
>>>>>> On Thu, Jun 03 2010, Michael Kerrisk wrote:
>>>>>>> Hi Jens,
>>>>>>>
>>>>>>> On Thu, Jun 3, 2010 at 8:10 AM, Jens Axboe <jaxboe@xxxxxxxxxxxx> wrote:
>>>>>>>> On Wed, Jun 02 2010, Michael Kerrisk wrote:
>>>>>>>>> On Tue, Jun 1, 2010 at 9:45 AM, Jens Axboe <axboe@xxxxxxxxx> wrote:
>>>>>>>>>> On Thu, May 27 2010, Michael Kerrisk wrote:
>>>>>>>>>>> Jens,
>>>>>>>>>>>
>>>>>>>>>>> On Mon, May 24, 2010 at 7:56 PM, Jens Axboe <jens.axboe@xxxxxxxxxx> wrote:
>>>>>>>>>>>> On Mon, May 24 2010, Michael Kerrisk wrote:
>>>>>>>>>>>>> On Mon, May 24, 2010 at 7:35 PM, Jens Axboe <jens.axboe@xxxxxxxxxx> wrote:
>>>>>>>>>>>>>> On Mon, May 24 2010, Michael Kerrisk wrote:
>>>>>>>>>>>>>>>> Right, that looks like a thinko.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'll submit a patch changing it to bytes and the agreed API and fix this
>>>>>>>>>>>>>>>> -Eerror. Thanks for your comments and suggestions!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks. And of course you are welcome. (Please CC linux-api@vger on
>>>>>>>>>>>>>>> this patche (and all patches that change the API/ABI.)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The first change is this:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=0191f8697bbdfefcd36e7b8dc3eeddfe82893e4b
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> and the one dealing with the pages vs bytes API is this:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=b9598db3401282bb27b4aef77e3eee12015f7f29
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Not tested yet, will do so before sending in of course.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Eyeballing it quickly, these changes look right.
>>>>>>>>>>>>
>>>>>>>>>>>> Good, thanks.
>>>>>>>>>>>>
>>>>>>>>>>>>> Do you have some test programs you can make available?
>>>>>>>>>>>>
>>>>>>>>>>>> Actually I don't, I test it by modifying fio's splice engine to set/get
>>>>>>>>>>>> the pipe size and test the resulting transfers.
>>>>>>>>>>>
>>>>>>>>>>> An afterthought. Do there not also need to be fixes to the /proc
>>>>>>>>>>> interfaces. I don't think they were included in your revised patches.
>>>>>>>>>>
>>>>>>>>>> I think the proc part can be sanely left in pages, since it's just a
>>>>>>>>>> memory limiter.
>>>>>>>>>
>>>>>>>>> I can't see any advantage to using two different units for these
>>>>>>>>> closely related APIs, and it does seem like it could be a source of
>>>>>>>>> confusion. Similar APIs that I can think of like RLIMIT_MEMLOCK and
>>>>>>>>> shmget() SHMMAX that impose per-process memory-related limits use
>>>>>>>>> bytes. Best to be consistent, don't you think?
>>>>>>>>
>>>>>>>> But they are different interfaces. I think the 'pass in required size,
>>>>>>>> return actual size' where actual size is >= required size makes sense
>>>>>>>> for the syscall part, but for an "admin" interface it is more logical to
>>>>>>>> deal in pages. Perhaps that's just me and the average admin does not
>>>>>>>> agree. So while it's just detail, it's also an interface so has some
>>>>>>>> importance. And if there's consensus that bytes is a cleaner interface
>>>>>>>> on the proc side as well, then lets change it.
>>>>>>>
>>>>>>> I'll add one more datapoint to those that I already mentioned.
>>>>>>> RLIMIT_STACK and RLIMIT_DATA (getrlimit()) is also expressed in bytes.
>>>>>>>
>>>>>>> There was only one vaguely related limit that I could find that
>>>>>>> measured things in pages. Consider these two System V shared memory
>>>>>>> limits:
>>>>>>>
>>>>>>> SHMMAX
>>>>>>> This is the maximum size (in bytes) of a shared memory segment.
>>>>>>>
>>>>>>> SHMALL
>>>>>>> This is a system-wide limit on the total number of pages of shared memory.
>>>>>>>
>>>>>>> But in a way this almost confirms my point. SHMMAX is a limit the
>>>>>>> governs the behavior of individual processes (like your /proc file),
>>>>>>> while SHMALL is a limit that governs the behavior of the system as a
>>>>>>> whole. There is a (sort of) logic to using bytes for one and pages for
>>>>>>> the other.
>>>>>>>
>>>>>>> I think that I've said all I need to say on the topic. I'm inclined to
>>>>>>> think yours /proc file should use bytes, since it seems consistent
>>>>>>> with other simialr APIs. Others may confirm, or someone else mught
>>>>>>> have a different insight.
>>>>>>
>>>>>> I'll commit a patch to change it to bytes.
>>>>>
>>>>> Thanks Jens.
>>>>
>>>> Since I'm going to document the /proc file, it occurred to me... What
>>>> are you going to call this file now? "pipe_max_pages" no longer makes
>>>> sense. "pipe_size_ceiling" may be more expressive than simply
>>>> "pipe_max".
>>>
>>> pipe_max_size?
>>
>> Okay too.
>
> Better use dashes in the name, of course.

I looked at that when I did the original pipe-max-pages, but it seems
we're not consistent in this regard. Anyway, here's what I came up with.
Not tested yet, and I hope thunderbird doesn't fsck up the patch.

diff --git a/fs/pipe.c b/fs/pipe.c
index f98fae3..f7f62f3 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -26,9 +26,14 @@

/*
* The max size that a non-root user is allowed to grow the pipe. Can
- * be set by root in /proc/sys/fs/pipe-max-pages
+ * be set by root in /proc/sys/fs/pipe_max_size
*/
-unsigned int pipe_max_pages = PIPE_DEF_BUFFERS * 16;
+unsigned int pipe_max_size = 1048576;
+
+/*
+ * Minimum pipe size, as required by POSIX
+ */
+unsigned int pipe_min_size = PAGE_SIZE;

/*
* We use a start+len construction, which provides full use of the
@@ -1156,6 +1161,35 @@ static long pipe_set_size(struct pipe_inode_info
*pipe, unsigned long nr_pages)
return nr_pages * PAGE_SIZE;
}

+/*
+ * Currently we rely on the pipe array holding a power-of-2 number
+ * of pages.
+ */
+static inline unsigned int round_pipe_size(unsigned int size)
+{
+ unsigned long nr_pages;
+
+ nr_pages = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ return roundup_pow_of_two(nr_pages) << PAGE_SHIFT;
+}
+
+/*
+ * This should work even if CONFIG_PROC_FS isn't set, as
proc_dointvec_minmax
+ * will return an error.
+ */
+int pipe_proc_fn(struct ctl_table *table, int write, void __user *buf,
+ size_t *lenp, loff_t *ppos)
+{
+ int ret;
+
+ ret = proc_dointvec_minmax(table, write, buf, lenp, ppos);
+ if (ret < 0 || !write)
+ return ret;
+
+ pipe_max_size = round_pipe_size(pipe_max_size);
+ return ret;
+}
+
long pipe_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
{
struct pipe_inode_info *pipe;
@@ -1169,23 +1203,19 @@ long pipe_fcntl(struct file *file, unsigned int
cmd, unsigned long arg)

switch (cmd) {
case F_SETPIPE_SZ: {
- unsigned long nr_pages;
+ unsigned int size, nr_pages;

- /*
- * Currently the array must be a power-of-2 size, so adjust
- * upwards if needed.
- */
- nr_pages = (arg + PAGE_SIZE - 1) >> PAGE_SHIFT;
- nr_pages = roundup_pow_of_two(nr_pages);
+ size = round_pipe_size(arg);
+ nr_pages = size >> PAGE_SHIFT;

- if (!capable(CAP_SYS_RESOURCE) && nr_pages > pipe_max_pages) {
+ if (!capable(CAP_SYS_RESOURCE) && size > pipe_max_size) {
ret = -EPERM;
goto out;
- } else if (nr_pages < 1) {
+ } else if (nr_pages < PAGE_SIZE) {
ret = -EINVAL;
goto out;
}
- ret = pipe_set_size(pipe, arg);
+ ret = pipe_set_size(pipe, nr_pages);
break;
}
case F_GETPIPE_SZ:
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 16de393..4457969 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -139,7 +139,9 @@ void pipe_lock(struct pipe_inode_info *);
void pipe_unlock(struct pipe_inode_info *);
void pipe_double_lock(struct pipe_inode_info *, struct pipe_inode_info *);

-extern unsigned int pipe_max_pages;
+extern unsigned int pipe_max_size, pipe_min_size;
+int pipe_proc_fn(struct ctl_table *, int, void __user *, size_t *,
loff_t *);
+

/* Drop the inode semaphore and wait for a pipe event, atomically */
void pipe_wait(struct pipe_inode_info *pipe);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 997080f..2f9b3a6 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1471,12 +1471,12 @@ static struct ctl_table fs_table[] = {
},
#endif
{
- .procname = "pipe-max-pages",
- .data = &pipe_max_pages,
+ .procname = "pipe_max_size",
+ .data = &pipe_max_size,
.maxlen = sizeof(int),
.mode = 0644,
- .proc_handler = &proc_dointvec_minmax,
- .extra1 = &two,
+ .proc_handler = &pipe_proc_fn,
+ .extra1 = &pipe_min_size,
},
/*
* NOTE: do not add new entries to this table unless you have read

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/