Re: [PATCH] mm,oom: Bring OOM notifier callbacks to outside of OOM killer.

From: Paul E. McKenney
Date: Tue Jun 26 2018 - 13:02:12 EST


On Thu, Jun 21, 2018 at 08:27:41PM +0900, Tetsuo Handa wrote:
> On 2018/06/21 7:36, David Rientjes wrote:
> >> @@ -1010,6 +1010,33 @@ int unregister_oom_notifier(struct notifier_block *nb)
> >> EXPORT_SYMBOL_GPL(unregister_oom_notifier);
> >>
> >> /**
> >> + * try_oom_notifier - Try to reclaim memory from OOM notifier list.
> >> + *
> >> + * Returns non-zero if notifier callbacks released something, zero otherwise.
> >> + */
> >> +unsigned long try_oom_notifier(void)
> >
> > It certainly is tried, but based on its usage it would probably be better
> > to describe what is being returned (it's going to set *did_some_progress,
> > which is a page count).
>
> Well, it depends on what the callbacks are doing. Currently, we have 5 users.
>
> arch/powerpc/platforms/pseries/cmm.c
> arch/s390/mm/cmm.c
> drivers/gpu/drm/i915/i915_gem_shrinker.c
> drivers/virtio/virtio_balloon.c
> kernel/rcu/tree_plugin.h
>
> Speak of rcu_oom_notify() in kernel/rcu/tree_plugin.h , we can't tell whether
> the callback helped releasing memory, for it does not update the "freed" argument.

It does not immediately release memory, but instead takes steps to
encourage memory to be released more quickly.

> >> +{
> >> + static DEFINE_MUTEX(oom_notifier_lock);
> >> + unsigned long freed = 0;
> >> +
> >> + /*
> >> + * Since OOM notifier callbacks must not depend on __GFP_DIRECT_RECLAIM
> >> + * && !__GFP_NORETRY memory allocation, waiting for mutex here is safe.
> >> + * If lockdep reports possible deadlock dependency, it will be a bug in
> >> + * OOM notifier callbacks.
> >> + *
> >> + * If SIGKILL is pending, it is likely that current thread was selected
> >> + * as an OOM victim. In that case, current thread should return as soon
> >> + * as possible using memory reserves.
> >> + */
> >> + if (mutex_lock_killable(&oom_notifier_lock))
> >> + return 0;
> >> + blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
> >> + mutex_unlock(&oom_notifier_lock);
> >> + return freed;
> >> +}
> >
> > If __blocking_notifier_call_chain() used down_read_killable(), could we
> > eliminate oom_notifier_lock?
>
> I don't think we can eliminate it now, for it is a serialization lock
> (while trying to respond to SIGKILL as soon as possible) which is currently
> achieved by mutex_trylock(&oom_lock).
>
> (1) rcu_oom_notify() in kernel/rcu/tree_plugin.h is not prepared for being
> called concurrently.
>
> ----------
> static int rcu_oom_notify(struct notifier_block *self,
> unsigned long notused, void *nfreed)
> {
> int cpu;
>
> /* Wait for callbacks from earlier instance to complete. */
> wait_event(oom_callback_wq, atomic_read(&oom_callback_count) == 0); // <= Multiple threads can pass this line at the same time.
> smp_mb(); /* Ensure callback reuse happens after callback invocation. */
>
> /*
> * Prevent premature wakeup: ensure that all increments happen
> * before there is a chance of the counter reaching zero.
> */
> atomic_set(&oom_callback_count, 1); // <= Multiple threads can execute this line at the same time.
>
> for_each_online_cpu(cpu) {
> smp_call_function_single(cpu, rcu_oom_notify_cpu, NULL, 1);
> cond_resched_tasks_rcu_qs();
> }
>
> /* Unconditionally decrement: no need to wake ourselves up. */
> atomic_dec(&oom_callback_count); // <= Multiple threads can execute this line at the same time, making oom_callback_count < 0 ?
>
> return NOTIFY_OK;
> }
> ----------
>
> The counter inconsistency problem could be fixed by
>
> - atomic_set(&oom_callback_count, 1);
> + atomic_inc(&oom_callback_count);
>
> but who becomes happy if rcu_oom_notify() became ready to be called
> concurrently? We want to wait for the callback to complete before
> proceeding to the OOM killer. I think that we should save CPU resource
> by serializing concurrent callers.

There are a lot of ways it could be made concurrency safe. If you need
me to do this, please do let me know.

That said, the way it is now written, if you invoke rcu_oom_notify()
twice in a row, the second invocation will wait until the memory from
the first invocation is freed. What do you want me to do if you invoke
me concurrently?

1. One invocation "wins", waits for the earlier callbacks to
complete, then encourages any subsequent callbacks to be
processed more quickly. The other invocations return
immediately without doing anything.

2. The invocations serialize, with each invocation waiting for
the callbacks from previous invocation (in mutex_lock() order
or some such), and then starting a new round.

3. Something else?

Thanx, Paul

> (2) i915_gem_shrinker_oom() in drivers/gpu/drm/i915/i915_gem_shrinker.c depends
> on mutex_trylock() from shrinker_lock() from i915_gem_shrink() from
> i915_gem_shrink_all() to return 1 (i.e. succeed) before need_resched()
> becomes true in order to avoid returning without reclaiming memory.
>
> > This patch is certainly an improvement because it does the last
> > get_page_from_freelist() call after invoking the oom notifiers that can
> > free memory and we've otherwise pointlessly redirected it elsewhere.
>
> Thanks, but this patch might break subtle balance which is currently
> achieved by mutex_trylock(&oom_lock) serialization/exclusion.
>
> (3) virtballoon_oom_notify() in drivers/virtio/virtio_balloon.c by default
> tries to release 256 pages. Since this value is configurable, one might
> set 1048576 pages. If virtballoon_oom_notify() is concurrently called by
> many threads, it might needlessly deflate the memory balloon.
>
> We might want to remember and reuse the last result among serialized callers
> (feedback mechanism) like
>
> {
> static DEFINE_MUTEX(oom_notifier_lock);
> static unsigned long last_freed;
> unsigned long freed = 0;
> if (mutex_trylock(&oom_notifier_lock)) {
> blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
> last_freed = freed;
> } else {
> mutex_lock(&oom_notifier_lock);
> freed = last_freed;
> }
> mutex_unlock(&oom_notifier_lock);
> return freed;
>
> }
>
> or
>
> {
> static DEFINE_MUTEX(oom_notifier_lock);
> static unsigned long last_freed;
> unsigned long freed = 0;
> if (mutex_lock_killable(&oom_notifier_lock)) {
> freed = last_freed;
> last_freed >>= 1;
> return freed;
> } else if (last_freed) {
> freed = last_freed;
> last_freed >>= 1;
> } else {
> blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
> last_freed = freed;
> }
> mutex_unlock(&oom_notifier_lock);
> return freed;
> }
>
> . Without feedback mechanism, mutex_lock_killable(&oom_notifier_lock) serialization
> could still needlessly deflate the memory balloon compared to mutex_trylock(&oom_lock)
> serialization/exclusion. Maybe virtballoon_oom_notify() (and two CMM users) would
> implement feedback mechanism themselves, by examining watermark from OOM notifier
> hooks.
>
>
>
> On 2018/06/21 16:31, Michal Hocko wrote:
> > On Wed 20-06-18 15:36:45, David Rientjes wrote:
> > [...]
> >> That makes me think that "oom_notify_list" isn't very intuitive: it can
> >> free memory as a last step prior to oom kill. OOM notify, to me, sounds
> >> like its only notifying some callbacks about the condition. Maybe
> >> oom_reclaim_list and then rename this to oom_reclaim_pages()?
> >
> > Yes agreed and that is the reason I keep saying we want to get rid of
> > this yet-another-reclaim mechanism. We already have shrinkers which are
> > the main source of non-lru pages reclaim. Why do we even need
> > oom_reclaim_pages? What is fundamentally different here? Sure those
> > pages should be reclaimed as the last resort but we already do have
> > priority for slab shrinking so we know that the system is struggling
> > when reaching the lowest priority. Isn't that enough to express the need
> > for current oom notifier implementations?
> >
>
> Even if we update OOM notifier users to use shrinker hooks, they will need a
> subtle balance which is currently achieved by mutex_trylock(&oom_lock).
>
> Removing OOM notifier is not doable right now. It is not suitable as a regression
> fix for commit 27ae357fa82be5ab ("mm, oom: fix concurrent munlock and oom reaper
> unmap, v3"). What we could afford for this regression is
> https://patchwork.kernel.org/patch/9842889/ which is exactly what you suggested
> in a thread at https://www.spinics.net/lists/linux-mm/msg117896.html .
>