Re: [PATCH 1/2] mm, oom: introduce oom reaper

From: David Rientjes
Date: Tue Feb 02 2016 - 17:52:00 EST


On Tue, 2 Feb 2016, Michal Hocko wrote:

> > Not exclude them, but I would have expected untrack_pfn().
>
> My understanding is that vm_normal_page will do the right thing for
> those mappings - especially for CoW VM_PFNMAP which are normal pages
> AFAIU. Wrt. to untrack_pfn I was relying that the victim will eventually
> enter exit_mmap and do the remaining house keepining. Maybe I am missing
> something but untrack_pfn shouldn't lead to releasing a considerable
> amount of memory. So is this really necessary or we can wait for
> exit_mmap?
>

I think if you move the code to mm/memory.c that you may find a greater
opportunity to share code with the implementations there and this will
take care of itself :) I'm concerned about this also from a
maintainability standpoint where a future patch might modify one
implementation while forgetting about the other. I think there's a great
opportunity here for a really clean and shiny interfance that doesn't
introduce any more complexity.

> > The problem is that this is racy and quite easy to trigger: imagine if
> > __oom_reap_vmas() finds mm->mm_users == 0, because the memory of the
> > victim has been freed, and then another system-wide oom condition occurs
> > before the oom reaper's mm_to_reap has been set to NULL.
>
> Yes I realize this is potentially racy. I just didn't consider the race
> important enough to justify task queuing in the first submission. Tetsuo
> was pushing for this already and I tried to push back for simplicity in
> the first submission. But ohh well... I will queue up a patch to do this
> on top. I plan to repost the full patchset shortly.
>

Ok, thanks! It should probably be dropped from -mm in the interim until
it has some acked-by's, but I think those will come pretty quickly once
it's refreshed if all of this is handled.

> > In this case, the oom reaper has ignored the next victim and doesn't do
> > anything; the simple race has prevented it from zapping memory and does
> > not reduce the livelock probability.
> >
> > This can be solved either by queueing mm's to reap or involving the oom
> > reaper into the oom killer synchronization itself.
>
> as we have already discussed previously oom reaper is really tricky to
> be called from the direct OOM context. I will go with queuing.
>

Hmm, I wasn't referring to oom context: it would be possible without
queueing with an mm_to_reap_lock (or cmpxchg) in the oom reaper and when
the final mmput() is done. Set it when the mm is ready for reaping, clear
it when the mm is being destroyed, and test it before calling the oom
killer. I think we'd want to defer the oom killer until potential reaping
could be done anyway and I don't anticipate an issue where oom_reaper
fails to schedule.

> > I'm baffled by any reference to "memcg oom heavy loads", I don't
> > understand this paragraph, sorry. If a memcg is oom, we shouldn't be
> > disrupting the global runqueue by running oom_reaper at a high priority.
> > The disruption itself is not only in first wakeup but also in how long the
> > reaper can run and when it is rescheduled: for a lot of memory this is
> > potentially long. The reaper is best-effort, as the changelog indicates,
> > and we shouldn't have a reliance on this high priority: oom kill exiting
> > can't possibly be expected to be immediate. This high priority should be
> > removed so memcg oom conditions are isolated and don't affect other loads.
>
> If this is a concern then I would be tempted to simply disable oom
> reaper for memcg oom altogether. For me it is much more important that
> the reaper, even though a best effort, is guaranteed to schedule if
> something goes terribly wrong on the machine.
>

I don't believe the higher priority guarantees it is able to schedule any
more than it was guaranteed to schedule before. It will run, but it won't
preempt other innocent processes in disjoint memcgs or cpusets. It's not
only a memcg issue, but it also impacts disjoint cpuset mems and mempolicy
nodemasks. I think it would be disappointing to leave those out. I think
the higher priority should simply be removed in terms of fairness.

Other than these issues, I don't see any reason why a refreshed series
wouldn't be immediately acked. Thanks very much for continuing to work on
this!