Re: [RFC PATCH 2/2] memcg: do not report racy no-eligible OOM tasks

From: Tetsuo Handa
Date: Tue Oct 23 2018 - 08:34:07 EST


On 2018/10/23 21:10, Michal Hocko wrote:
> On Tue 23-10-18 13:42:46, Michal Hocko wrote:
>> On Tue 23-10-18 10:01:08, Tetsuo Handa wrote:
>>> Michal Hocko wrote:
>>>> On Mon 22-10-18 20:45:17, Tetsuo Handa wrote:
>>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>>>> index e79cb59552d9..a9dfed29967b 100644
>>>>>> --- a/mm/memcontrol.c
>>>>>> +++ b/mm/memcontrol.c
>>>>>> @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>>>>>> .gfp_mask = gfp_mask,
>>>>>> .order = order,
>>>>>> };
>>>>>> - bool ret;
>>>>>> + bool ret = true;
>>>>>>
>>>>>> mutex_lock(&oom_lock);
>>>>>> +
>>>>>> + /*
>>>>>> + * multi-threaded tasks might race with oom_reaper and gain
>>>>>> + * MMF_OOM_SKIP before reaching out_of_memory which can lead
>>>>>> + * to out_of_memory failure if the task is the last one in
>>>>>> + * memcg which would be a false possitive failure reported
>>>>>> + */
>>>>>> + if (tsk_is_oom_victim(current))
>>>>>> + goto unlock;
>>>>>> +
>>>>>
>>>>> This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock)
>>>>> so that any killed threads no longer wait for oom_lock.
>>>>
>>>> tsk_is_oom_victim is stronger because it doesn't depend on
>>>> fatal_signal_pending which might be cleared throughout the exit process.
>>>>
>>>
>>> I still want to propose this. No need to be memcg OOM specific.
>>
>> Well, I maintain what I've said [1] about simplicity and specific fix
>> for a specific issue. Especially in the tricky code like this where all
>> the consequences are far more subtle than they seem to be.
>>
>> This is obviously a matter of taste but I don't see much point discussing
>> this back and forth for ever. Unless there is a general agreement that
>> the above is less appropriate then I am willing to consider a different
>> change but I simply do not have energy to nit pick for ever.
>>
>> [1] http://lkml.kernel.org/r/20181022134315.GF18839@xxxxxxxxxxxxxx
>
> In other words. Having a memcg specific fix means, well, a memcg
> maintenance burden. Like any other memcg specific oom decisions we
> already have. So are you OK with that Johannes or you would like to see
> a more generic fix which might turn out to be more complex?
>

I don't know what "that Johannes" refers to.

If you don't want to affect SysRq-OOM and pagefault-OOM cases,
are you OK with having a global-OOM specific fix?

mm/page_alloc.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e2ef1c1..f59f029 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3518,6 +3518,17 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
if (gfp_mask & __GFP_THISNODE)
goto out;

+ /*
+ * It is possible that multi-threaded OOM victims get
+ * task_will_free_mem(current) == false when the OOM reaper quickly
+ * set MMF_OOM_SKIP. But since we know that tsk_is_oom_victim() == true
+ * tasks won't loop forever (unless it is a __GFP_NOFAIL allocation
+ * request), we don't need to select next OOM victim.
+ */
+ if (tsk_is_oom_victim(current) && !(gfp_mask & __GFP_NOFAIL)) {
+ *did_some_progress = 1;
+ goto out;
+ }
/* Exhausted what can be done so it's blame time */
if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
*did_some_progress = 1;
--
1.8.3.1