Re: [PATCH] mm, memcg: reclaim more aggressively before high allocator throttling

From: Chris Down
Date: Thu May 21 2020 - 10:23:18 EST


Michal Hocko writes:
On Thu 21-05-20 14:41:47, Chris Down wrote:
Michal Hocko writes:
> On Thu 21-05-20 13:57:59, Chris Down wrote:
[...]
> > If you're talking about reclaim, trying to reason about whether the overage
> > is the result of some other task in this cgroup or the task that's
> > allocating right now is something that we already know doesn't work well
> > (eg. global OOM).
>
> I am not sure I follow you here.

Let me rephrase: your statement is that it's not desirable "that some task
would be throttled unexpectedly too long because of [the activity of another
task also within that cgroup]" (let me know if that's not what you meant).
But trying to avoid that requires knowing which activity abstractly
instigates this entire mess in the first place, which we have nowhere near
enough context to determine.

Yeah, if we want to be really precise then you are right, nothing like
that is really feasible for the reclaim. Reclaiming 1 page might be much
more expensive than 100 pages because LRU order doesn't reflect the
cost of the reclaim at all. What, I believe, we want is a best effort,
really. If the reclaim target is somehow bound to the requested amount
of memory then we can at least say that more memory hungry consumers are
reclaiming more. Which is something people can wrap their head around
much easier than a free competition on the reclaim with some hard to
predict losers who do all the work and some lucky ones which just happen
to avoid throttling by a better timing. Really think of the direct
reclaim and how the unfairness suck there.

I really don't follow this logic. You're talking about reclaim-induced latency, but the alternative isn't freedom from latency, it's scheduler-induced latency from allocator throttling (and probably of a significantly higher magnitude). And again, that's totally justified if you are part of a cgroup which is significantly above its memory.high -- that's the kind of grouping you sign up for when you put multiple tasks in the same cgroup.

The premise of being over memory.high is that everyone in the affected cgroup must do their utmost to reclaim where possible, and if they fail to get below it again, we're going to deschedule them. *That's* what's best-effort about it.

The losers aren't hard to predict. It's *all* the tasks in this cgroup if they don't each make their utmost attempt to get the cgroup's memory back under control. Doing more reclaim isn't even in the same magnitude of sucking as getting allocator throttled.