Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

From: Michal Hocko
Date: Sat Sep 19 2015 - 11:51:16 EST


On Sat 19-09-15 23:33:07, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > This has been posted in various forms many times over past years. I
> > still do not think this is a right approach of dealing with the problem.
>
> I do not think "GFP_NOFS can fail" patch is a right approach because
> that patch easily causes messages like below.
>
> Buffer I/O error on dev sda1, logical block 34661831, lost async page write
> XFS: possible memory allocation deadlock in kmem_alloc (mode:0x8250)
> XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
> XFS: possible memory allocation deadlock in kmem_zone_alloc (mode:0x8250)

These messages just tell you that the allocation fails repeatedly. Have
a look and check the code. They are basically opencoded NOFAIL
allocations. They haven't been converted to actually tell the MM layer
that they cannot fail because Dave said they have a long term plan to
change this code and basically implement different failing strategies.

> Adding __GFP_NOFAIL will hide these messages but OOM stall remains anyway.
>
> I believe choosing more OOM victims is the only way which can solve OOM stalls.

I am very well aware of your position and all the attempts to tweak
different code paths to actually pass your corner case. I, however, care
for the longer term goals more. And I believe that the page allocator
and the reclaim should strive for being less deadlock prone in the
first place. That includes a more natural semantic and non-failing
default semantic is really error prone IMHO. We have been through this
discussion many times already and I've tried to express this is a long
term goal with incremental steps.
I really hate to do "easy" things now just to feel better about
particular case which will kick us back little bit later. And from my
own experience I can tell you that a more non-deterministic OOM behavior
is thing people complain about.

> > You can quickly deplete memory reserves this way without making further
> > progress (I am afraid you can even trigger this from userspace without
> > having big privileges) so even administrator will have no way to
> > intervene.
>
> I think that use of ALLOC_NO_WATERMARKS via TIF_MEMDIE is the underlying
> cause. ALLOC_NO_WATERMARKS via TIF_MEMDIE is intended for terminating the
> OOM victim task as soon as possible, but it turned out that it will not
> work if there is invisible lock dependency.

Of course. This is a heurstic and as such it cannot ever work in 100%
situations. And it is not the first heuristic we have for the OOM
killer. The last time this has been all rewritten was because the OOM
killer was too unreliable/non-deterministic. Reports have decreased
considerable since then.

> Therefore, why not to give up
> "there should be only up to 1 TIF_MEMDIE task" rule?

This has been explained several times. There is no guaranteed this would
help and _your_ own usecase shows how you can end up with such a long
lock dependency chains that you can easily eat up the whole memory
reserves before you can make any progress.

I do agree that a hand break mechanism is really desirable for those who
really care.

> What this patch (and many others posted in various forms many times over
> past years) does is to give up "there should be only up to 1 TIF_MEMDIE
> task" rule. I think that we need to tolerate more than 1 TIF_MEMDIE tasks
> and somehow manage in a way memory reserves will not deplete.

But those two goes against each other.

[...]

> If you still want to keep "there should be only up to 1 TIF_MEMDIE task"
> rule, what alternative do you have? (I do not like panic_on_oom_timeout
> because it is more data-lossy approach than choosing next OOM victim.)

I am not married to 1 TIF_MEMDIE task thing. I just think that there is
still a lot of room for other improvements. The original issue which
triggered this discussion again is a good example. I completely miss why
a writer has to be unkillable when the fs is frozen. There are others
which are more complicated of course. Including the whole class
represented by GFP_NOFS allocations as you have noted. But we still have
a room for improvements even in the reclaim. It has been suggested quite
some time ago that the memory mapped by the OOM victim might be
unmapped. Basically what Oleg is proposing in other email. I didn't get
to read his email yet properly but that should certainly help to reduce
the problem space.

--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/