Re: v3.4-rc2 out-of-memory problems (was Re: 3.4-rc1sticks-and-crashs)

From: David Rientjes
Date: Mon Apr 09 2012 - 19:25:53 EST


On Mon, 9 Apr 2012, Linus Torvalds wrote:

> > Right, we can't handoff the freeing of the task_struct to more than one
> > notifier. ÂIt seems misdesigned from the beginning and what we really want
> > is to hijack task->usage for __put_task_struct(task) if we have such a
> > notifier callchain and require each one (currently just oprofile) to take
> > a reference on task->usage for NOTIFY_OK and then be responsible for
> > dropping the reference when it's done with it later instead of requiring
> > it to free the task_struct itself.
>
> We could make notifier.c just "or" all the return values together, and
> then it's ok if *one* person returns NOTIFY_OK.
>

You could that if you also turned the check for "ret == NOTIFY_OK" in
profile_handoff_task() into "ret & NOTIFY_OK" in your patch, otherwise you
get a double free from __put_task_struct() and oprofile.

> Of course, that's not how notifiers are documented to work, but quite
> frankly, notifiers with non-zero values that don't sat STOP are broken
> as-is anyway, so you might we well do a logical "or" of the return
> values and at least make things like this work.
>

It works fine if the callbacks are correctly implemented, it's just that
the task handoff in kernel/profile.c is broken because it assumes only one
callback will return NOTIFY_OK, meaning it will eventually free, and its
only checking the return value of the last notifier called to see if
__put_task_struct() should immediately free.

In defense of notifiers, though, it works fine right now for memory
hotplug. The last issue I had with it was when slab lacked a callback
when a node was onlined or offlined in 2.6.34 and then I added memory
hotplug support for that allocator and it has since worked fine. For
things like MEM_GOING_OFFLINE, returning NOTIFY_BAD is great if the
subsystem of interest can't allow the memory to go offline (in-use slab
objects, for example). In the memory hotplug usecase, we certainly don't
want to stop at NOTIFY_OK because we need to notify every subsystem on the
callchain.

> Oh well. So my suggestion right now would be something like the
> attached. It's still horribly broken, it actively breaks documented
> notifier behavior, but dammit, if the notifier people don't like
> 'or'ing return values together they should damn well return zero from
> the notifier that doesn't do anything. And returning an error will
> exit out, so..
>

Instead of this and it's possible bad interactions with other notifiers
during the -rc cycle, I think it would be better to

(1) fix the lowmemorykiller so it doesn't need to use these notifiers at
all, which isn't difficult, for 3.4, then

(2a) change the task handoff to a refcount on task->usage after the final
put_task_struct() using the notifier and then allow it to be freed
after everybody does a put_handoff_task_struct() for 3.5

or

(2b) remove the task handoff notifier callchain entirely and just tie it
directly to oprofile since android won't be using it anymore after
(1).