Re: [PATCH 0/3] OOM detection rework v4

From: Michal Hocko
Date: Thu Feb 04 2016 - 09:24:09 EST


On Thu 04-02-16 14:39:05, Michal Hocko wrote:
> On Thu 04-02-16 22:10:54, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > I am not sure we can fix these pathological loads where we hit the
> > > higher order depletion and there is a chance that one of the thousands
> > > tasks terminates in an unpredictable way which happens to race with the
> > > OOM killer.
> >
> > When I hit this problem on Dec 24th, I didn't run thousands of tasks.
> > I think there were less than one hundred tasks in the system and only
> > a few tasks were running. Not a pathological load at all.
>
> But as the OOM report clearly stated there were no > order-1 pages
> available in that particular case. And that happened after the direct
> reclaim and compaction were already invoked.
>
> As I've mentioned in the referenced email, we can try to do multiple
> retries e.g. do not give up on the higher order requests until we hit
> the maximum number of retries but I consider it quite ugly to be honest.
> I think that a proper communication with compaction is a more
> appropriate way to go long term. E.g. I find it interesting that
> try_to_compact_pages doesn't even care about PAGE_ALLOC_COSTLY_ORDER
> and treat is as any other high order request.
>
> Something like the following:

With the patch description. Please note I haven't tested this yet so
this is more a RFC than something I am really convinced about. I can
live with it because the number of retries is nicely bounded but it
sounds too hackish because it makes the decision rather blindly. I will
talk to Vlastimil and Mel whether they see some way how to communicate
the compaction state in a reasonable way. But I guess this is something
that can come up later. What do you think?
---