Re: [patch] aio: remove aio-max-nr and instead use the memlockrlimit to limit the number of pages pinned for the aio completion ring

From: Andrew Morton
Date: Mon Mar 09 2009 - 18:37:44 EST


On Mon, 09 Mar 2009 11:49:57 -0400
Jeff Moyer <jmoyer@xxxxxxxxxx> wrote:

> Hi,
>
> Believe it or not, I get numerous questions from customers about the
> suggested tuning value of aio-max-nr. aio-max-nr limits the total
> number of io events that can be reserved, system wide, for aio
> completions. Each time io_setup is called, a ring buffer is allocated
> that can hold nr_events I/O completions. That ring buffer is then
> mapped into the process' address space, and the pages are pinned in
> memory. So, the reason for this upper limit (I believe) is to keep a
> malicious user from pinning all of kernel memory. Now, this sounds like
> a much better job for the memlock rlimit to me, hence the following
> patch.
>
> It's worth noting that, by default, aio-max-nr was set to 65536
> requests. By default, the memlock rlimit is set to 64kb per process.
> With this patch in place, a single process can specify 2045 for the
> nr_events parameter of io_setup. Or, for the worst case scenario, a
> process can only create 16 io contexts, each with a single event (since
> the minimum ring size is a page).
>
> I created a simple program that sets the process' memlock rlimit and
> then drops the CAP_IPC_LOCK capability (the memlock rlimit is a command
> line parameter, so you can see how the nr_events allowable changes with
> different limits in place). Then, it calls io_setup/io_destroy in a
> loop, increasing nr_events until it fails. Finally, it calls io_setup
> in a loop with a single event to see how many iterations it can perform
> before receiving -EAGAIN. I ran the test on a patched kernel and the
> results are in line with expectations.
>
> I also ran the aio-dio-regress regression tests, and all passed but
> one. The one that failed only failed due to an expected return code of
> -ENOMEM, and instead got -EAGAIN. Part of the patch below returns a
> proper error code from aio_setup_ring. So, I'm calling this a test
> issue, but we can debate this nuance if it is deemed important.
>
> Further, as a result of this exercise, I noticed that there are numerous
> places in the kernel that test to see if locking memory will exceed the
> maximum amount of allowed locked memory. I've factored out that bit of
> code in a follow-on patch that I will post.
>
> Finally, I updated the aio man pages to reflect this change (and fix
> references to aio_context_t that should be io_context_t).
>
> You can find a copy of the test program I used at:
> http://people.redhat.com/jmoyer/aio/ioctx_alloc.c
> I'll apologize in advance for the crudity of the test code.
>
> For those reviewing the below patch, it may be worth looking at:
>
> commit d55b5fdaf40846221d543937b786956e27837fda
> Author: Zach Brown <zach.brown@xxxxxxxxxx>
> Date: Mon Nov 7 00:59:31 2005 -0800
>
> [PATCH] aio: remove aio_max_nr accounting race
>
> The below patch basically reverts the above commit. There is no
> accounting race now, because we are charging per-process rlimits instead
> of a system-wide maximum number of aio requests. This has the added
> benefit of reducing some code complexity.

It's risky to simply remove an existing tunable. What if someone's
mission-critical startup script which is provided by a super-slow or
even no-longer-in-business vendor does

if (write(aoi-max-nr, something) == error)
crash_and_burn()

?

It would be prudent to have a more cautious update scheme. Leave the
existing tunable in place. Keep it working if possible. If someone
uses it, blurt out a loud printk telling them that they're using a
deprecated interface and informing them how to update.

Then at some later time we can remove the old interface.

> /*------ sysctl variables----*/
> -static DEFINE_SPINLOCK(aio_nr_lock);
> -unsigned long aio_nr; /* current system wide number of aio requests */
> -unsigned long aio_max_nr = 0x10000; /* system wide maximum number of aio requests */
> +/* current system wide number of aio requests */
> +atomic_t aio_nr = ATOMIC_INIT(0);

Was it a conscious decision to downgrade this from a `long' type to an
`int' type? It _could_ have used atomic_long_t.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/