Re: [PATCH v5] mm/oom_kill.c: futex: Close a race between do_exit and the oom_reaper

From: Nico Pache
Date: Wed Apr 06 2022 - 15:33:01 EST

Next message: kernel test robot: "[ammarfaizi2-block:robh/linux/dt/pop-pci-nodes 2/2] drivers/pci/of.c:84: undefined reference to `.of_attach_node'"
Previous message: patchwork-bot+netdevbpf: "Re: [PATCH bpf-next] bpf/selftests: use bpf_num_possible_cpus() in per-cpu map allocations"
In reply to: Nico Pache: "Re: [PATCH v5] mm/oom_kill.c: futex: Close a race between do_exit and the oom_reaper"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 4/6/22 11:22, Nico Pache wrote:
>
>
> On 3/22/22 09:17, Thomas Gleixner wrote:
>> On Tue, Mar 22 2022 at 09:26, Michal Hocko wrote:
>>> On Mon 21-03-22 19:57:24, Davidlohr Bueso wrote:
>>>> On Mon, 21 Mar 2022, Nico Pache wrote:
>>>>
>>>>> We could proceed with the V3 approach; however if we are able to find a complete
>>>>> solution that keeps both functionalities (Concurrent OOM Reaping & Robust Futex)
>>>>> working, I dont see why we wouldnt go for it.
>>
>> See below.
>>
>>>> Because semantically killing the process is, imo, the wrong thing to do.
>>>
>>> I am not sure I follow. The task has been killed by the oom killer. All
>>> we are discussing here is how to preserve the robust list metadata
>>> stored in the memory which is normally unmapped by the oom_reaper to
>>> guarantee a further progress.
>>>
>>> I can see we have 4 potential solutions:
>>> 1) do not oom_reap oom victims with robust futex metadata in anonymous
>>> memory. Easy enough but it could lead to excessive oom killing in
>>> case the victim gets stuck in the kernel and cannot terminate.
>>> 2) clean up robust list from the oom_reaper context. Seems tricky due to
>>> #PF handling from the oom_reaper context which would need to be
>>> non-blocking
>>> 3) filter vmas which contain robust list. Simple check for the vma
>>> range
>>> 4) internally mark vmas which have to preserve the state during
>>> oom_reaping. Futex code would somehow have to mark those mappings.
>>> While more generic solution. I am not sure this is a practical
>>> approach.
>>
>> And all of that is based on wishful thinking, really. Let me explain.
>>
>> The task::robust_list pointer is set unconditionally by NPTL for every
>> thread of a process. It points to the 'list head' which is in the
>> TLS. But this does not tell whether the task holds a robust futex or
>> not. That's evaluated in the futex exit handling code.
>
> Ah, thanks for pointing that out. So yes, skipping the OOM if it contains a
> robust list is not really ideal as any process with pthreads will be skipped.
>
> Would it be logical to change this so that we are no longer making this syscall
> unconditionally? I still agree that skipping the OOM isnt very logical if we
> have a better solution available. Is it set unconditionally so that users dont
> have to do it dynamically when they enable the robustness?
> If this is the case it may be too much of a headache to implement.
>
>>
>> So solution #1 will prevent oom reaping completely simply because the
>> pointer is set on every user space task.
> Every userspace task that implements a pthread.
I stand corrected... Joel just showed me that it is indeed set on every
userspace task.

Not entirely sure why... but Im wrong here. >>
>> Solutions #2 and #3 are incomplete and just awful hacks which cure one
>> particular case: A single threaded process. Why?
>>
>> The chosen oom reaper victim is a process, so what does it help to check
>> or cleanup the robust list for _ONE_ thread? Nothing because the other
>> threads can hold robust futexes and then run into the same problem.
>>
>> Aside of that you seem to believe that the robust list head in the TLS
>> is the only part which is relevant. That's wrong. The list head is
>> either NULL or points to the innermost pthread_mutex which is held by a
>> task. Now look at this example:
>>
>> TLS:robust_list -> mutex2 -> mutex1
>>
>> mutex1 is the shared one which needs to be released so that other
>> processes can make progress. mutex2 is a process private one which
>> resides in a different VMA. So now if you filter the robust list and
>> refuse to reap the TLS VMA, what prevents the other VMA from being
>> reaped? If that's reaped then mutex1 is not reachable.
>
> This is a interesting case... So skipping the robust_head VMA would solve
> the the case were all the locks are shared which is an improvement over the
> current implementation.
>
> We have been trying to modify our reproducer to creates the case described here,
> but so far have been unsuccessful.
>>
>> Now vs. cleaning up the robust list from the oom reaper context. That
>> should be doable with a lot of care, but the proposed patch is not even
>> close to a solution. It's simply broken.
>>
>>> -static void futex_cleanup_begin(struct task_struct *tsk)
>>> +static bool futex_cleanup_begin(struct task_struct *tsk, bool try)
>>> {
>>> /*
>>> * Prevent various race issues against a concurrent incoming waiter
>>> @@ -1055,7 +1056,12 @@ static void futex_cleanup_begin(struct task_struct *tsk)
>>> * tsk->futex_exit_mutex when it observes FUTEX_STATE_EXITING in
>>> * attach_to_pi_owner().
>>> */
>>> - mutex_lock(&tsk->futex_exit_mutex);
>>> + if (try) {
>>> + if (!mutex_trylock(&tsk->futex_exit_mutex))
>>> + return false;
>>> + } else {
>>> + mutex_lock(&tsk->futex_exit_mutex);
>>> + }
>>
>> That conditional locking is disgusting.
>>
>>> void futex_exit_release(struct task_struct *tsk)
>>> {
>>> - futex_cleanup_begin(tsk);
>>> + futex_cleanup_begin(tsk, false);
>>
>> If the task already cleaned up the robust list then this will roll back
>> tsk->futex_state from FUTEX_STATE_DEAD to FUTEX_STATE_EXITING. Sigh...
>>
>>> + futex_cleanup(tsk);
>>> + futex_cleanup_end(tsk, FUTEX_STATE_DEAD);
>>> +}
>>> +
>>> +/* Try to perform the futex_cleanup and return true if successful.
>>
>> This is not a proper multi line comment.
>>
>> /*
>> * Multi line comments look like this:
>> *
>> * Properly formatted.
>> *
>> * Don't try to use the network comment style
>> * on anything outside of networking.
>> */
>>
>>> + * Designed to be called from the context of the OOM Reaper.
>>
>> Let's talk about design later.
>>
>>> + */
>>> +bool try_futex_exit_release(struct task_struct *tsk)
>>> +{
>>> + if (!futex_cleanup_begin(tsk, true))
>>> + return false;
>>> +
>>> + /* We are calling this from the context of a kthread. We need to
>>> + * instruct the kthread to use the address space of the given mm
>>> + * so the get_user won't return -EFAULT.
>>
>> How is this preventing get_user() or any other operation on the tasks
>> user memory to return -EFAULT? Not at all. Any user access can fail and
> Without the kthread_use_mm the kthread cannot instrument on the memory and the
> get_users in futex_cleanup is guaranteed to fail... I left that comment to avoid
> confusion.
>> return -EFAULT. Comments are there to explain things not to create
>> confusion.
>>
>>> + */
>>> + kthread_use_mm(tsk->mm);
>>> futex_cleanup(tsk);
>>
>> But aside of that. How is this supposed to work correctly?
>>
>> oom_reaper()
>> oom_reap_task()
>> oom_reap_task_mm()
>> mmap_read_trylock(mm) <- Succeeds
>> try_futex_exit_release()
>> use_mm()
>> futex_cleanup()
>> get_user() -> #PF
>>
>> #PF
>> if (!mmap_read_trylock(mm)) {
>>
>> So here the problem starts. The trylock can succeed or not, depending
>> on the contention state of mm::mmap_lock.
>>
>> So in case the trylock fails because there is a writer waiting, then it
>> runs into this:
>>
>> if (!user_mode(regs) && !search_exception_tables(regs->ip)) {
>> ....
>> return;
>> }
>>
>> This condition evaluates to false because get_user() has an
>> exception table entry. So this proceeds and does:
>>
>> mmap_read_lock(mm);
>>
>> which is a full dead lock.
>>
>> But even if the trylock succeeds then this runs into the full fault
>> path, which is not correct either for pretty obvious reasons.
>>
>> I assume that's all part of the design, right?
>
> Yeah all of this makes the solution pretty useless in its current state. Thanks
> for pointing that out in detail.
>
>> But the real questions here are:
>>
>> Why are we doing this remote reaping at all?
>>
>> What is the condition that a task which is killed with a fatal signal
>> does not reach do_exit() and cleans up itself?
>>
>> If the answer is "because", then we should rather make sure that this
>> gets fixed.
>>
>> If there is a legitimate reason why a task cannot handle a fatal signal,
>> then yes the oom reaper might be necessary, but unleashing the oom
>> reaper unconditionally is simply a bad idea and results in the problem
>> which this is trying to paper over.
>>
>> The oom reaper should be the last resort IMO and not racing against the
>> killed task in the first place. IOW, give the task some time to clean
>> itself up and if that fails and it is truly stuck and unable to do so,
>> then reap the mm. But that should be the rare case and then the stuck
>> futex should be the least of our worries.
>>
>> Thanks,
>>
>> tglx
>
> Thanks for the review!
>
> Given our inability to reproduce the tls -> private mutex -> shared mutex case
> we are going to continue forward with the VMA skipping, as that should at least
> clear up the cases where all the locks are shared.
>
> Cheers,
> -- Nico
>

Next message: kernel test robot: "[ammarfaizi2-block:robh/linux/dt/pop-pci-nodes 2/2] drivers/pci/of.c:84: undefined reference to `.of_attach_node'"
Previous message: patchwork-bot+netdevbpf: "Re: [PATCH bpf-next] bpf/selftests: use bpf_num_possible_cpus() in per-cpu map allocations"
In reply to: Nico Pache: "Re: [PATCH v5] mm/oom_kill.c: futex: Close a race between do_exit and the oom_reaper"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]