Re: [PATCH 1/1] mm, oom_adj: don't loop through tasks in __set_oom_adj when not necessary

From: Eric W. Biederman
Date: Thu Aug 20 2020 - 11:10:44 EST


Oleg Nesterov <oleg@xxxxxxxxxx> writes:

> On 08/20, Oleg Nesterov wrote:
>>
>> On 08/20, Eric W. Biederman wrote:
>> >
>> > --- a/fs/exec.c
>> > +++ b/fs/exec.c
>> > @@ -1139,6 +1139,10 @@ static int exec_mmap(struct mm_struct *mm)
>> > vmacache_flush(tsk);
>> > task_unlock(tsk);
>> > if (old_mm) {
>> > + mm->oom_score_adj = old_mm->oom_score_adj;
>> > + mm->oom_score_adj_min = old_mm->oom_score_adj_min;
>> > + if (tsk->vfork_done)
>> > + mm->oom_score_adj = tsk->vfork_oom_score_adj;
>>
>> too late, ->vfork_done is NULL after mm_release().
>>
>> And this can race with __set_oom_adj(). Yes, the current code is racy too,
>> but this change adds another race, __set_oom_adj() could already observe
>> ->mm != NULL and update mm->oom_score_adj.
> ^^^^^^^^^^^^
>
> I meant ->mm == new_mm.
>
> And another problem. Suppose we have
>
> if (!vfork()) {
> change_oom_score();
> exec();
> }
>
> the parent can be killed before the child execs, in this case vfork_oom_score_adj
> will be lost.

Yes.

Looking at include/uapi/linux/oom.h it appears that there are a lot of
oom_score_adj values that are reserved. So it should be completely
possible to initialize vfork_oom_score_adj to -32768 aka SHRT_MIN, and
use that as a flag to see if it is active or not.

Likewise for vfork_oom_score_adj_min if we need to duplicate that one as
well.


That deals with that entire class of race. We still have races during
exec about vfork_done being cleared before the new ->mm == new_mm.
While that is worth fixing is an independent issue.

Eric