Re: [RFC PATCH v2 1/2] mm: tuning hardcoded reserved memory

From: Andrew Shewmaker
Date: Thu Feb 28 2013 - 20:11:50 EST


On Thu, Feb 28, 2013 at 02:12:00PM -0800, Andrew Morton wrote:
> On Wed, 27 Feb 2013 15:56:30 -0500
> Andrew Shewmaker <agshew@xxxxxxxxx> wrote:
>
> > The following patches are against the mmtom git tree as of February 27th.
> >
> > The first patch only affects OVERCOMMIT_NEVER mode, entirely removing
> > the 3% reserve for other user processes.
> >
> > The second patch affects both OVERCOMMIT_GUESS and OVERCOMMIT_NEVER
> > modes, replacing the hardcoded 3% reserve for the root user with a
> > tunable knob.
> >
>
> Gee, it's been years since anyone thought about the overcommit code.
>
> Documentation/vm/overcommit-accounting says that OVERCOMMIT_ALWAYS is
> "Appropriate for some scientific applications", but doesn't say why.
> You're running a scientific cluster but you're using OVERCOMMIT_NEVER,
> I think? Is the documentation wrong?

None of my scientists appeared to use sparse arrays as Alan described.
My users would run jobs that appeared to initialize correctly. However,
they wouldn't write to every page they malloced (and they wouldn't use
calloc), so I saw jobs failing well into a computation once the
simulation tried to access a page and the kernel couldn't give it to them.

I think Roadrunner (http://en.wikipedia.org/wiki/IBM_Roadrunner) was
the first cluster I put into OVERCOMMIT_NEVER mode. Jobs with
infeasible memory requirements fail early and the OOM killer
gets triggered much less often than in guess mode. More often than not
the OOM killer seemed to kill the wrong thing causing a subtle brokenness.
Disabling overcommit worked so well during the stabilization and
early user phases that we did the same with other clusters.

> > __vm_enough_memory reserves 3% of free pages with the default
> > overcommit mode and 6% when overcommit is disabled. These hardcoded
> > values have become less reasonable as memory sizes have grown.
> >
> > On scientific clusters, systems are generally dedicated to one user.
> > Also, overcommit is sometimes disabled in order to prevent a long
> > running job from suddenly failing days or weeks into a calculation.
> > In this case, a user wishing to allocate as much memory as possible
> > to one process may be prevented from using, for example, around 7GB
> > out of 128GB.
> >
> > The effect is less, but still significant when a user starts a job
> > with one process per core. I have repeatedly seen a set of processes
> > requesting the same amount of memory fail because one of them could
> > not allocate the amount of memory a user would expect to be able to
> > allocate.
> >
> > ...
> >
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -182,11 +182,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
> > allowed -= allowed / 32;
> > allowed += total_swap_pages;
> >
> > - /* Don't let a single process grow too big:
> > - leave 3% of the size of this process for other processes */
> > - if (mm)
> > - allowed -= mm->total_vm / 32;
> > -
> > if (percpu_counter_read_positive(&vm_committed_as) < allowed)
> > return 0;
>
> So what might be the downside for this change? root can't log in, I
> assume. Have you actually tested for this scenario and observed the
> effects?
>
> If there *are* observable risks and/or to preserve back-compatibility,
> I guess we could create a fourth overcommit mode which provides the
> headroom which you desire.
>
> Also, should we be looking at removing root's 3% from OVERCOMMIT_GUESS
> as well?

The downside of the first patch, which removes the "other" reserve
(sorry about the confusing duplicated subject line), is that a user
may not be able to kill their process, even if they have a shell prompt.
When testing, I did sometimes get into spot where I attempted to execute
kill, but got: "bash: fork: Cannot allocate memory". Of course, a
user can get in the same predicament with the current 3% reserve--they
just have to start processes until 3% becomes negligible.

With just the first patch, root still has a 3% reserve, so they can
still log in.

When I resubmit the second patch, adding a tunable rootuser_reserve_pages
variable, I'll test both guess and never overcommit modes to see what
minimum initial values allow root to login and kill a user's memory
hogging process. This will be safer than the current behavior since
root's reserve will never shrink to something useless in the case where
a user has grabbed all available memory with many processes.

As an estimate of a useful rootuser_reserve_pages, the rss+share size of
sshd, bash, and top is about 16MB. Overcommit disabled mode would need
closer to 360MB for the same processes. On a 128GB box 3% is 3.8GB, so
the new tunable would still be a win.

I think the tunable would benefit everyone over the current behavior,
but would you prefer it if I only made it tunable in a fourth overcommit
mode in order to preserve back-compatibility?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/