Re: [PATCH 2/3] mm/memcg: fix endless iteration in reclaim

From: Michal Hocko
Date: Tue Jan 21 2014 - 03:35:16 EST


On Mon 20-01-14 21:16:36, Hugh Dickins wrote:
> On Fri, 17 Jan 2014, Michal Hocko wrote:
> > On Thu 16-01-14 11:15:36, Hugh Dickins wrote:
> >
> > > I don't believe 19f39402864e was responsible for a reference leak,
> > > that came later. But I think it was responsible for the original
> > > endless iteration (shrink_zone going around and around getting root
> > > again and again from mem_cgroup_iter).
> >
> > So your hang is not within mem_cgroup_iter but you are getting root all
> > the time without any way out?
>
> In the 3.10 and 3.11 cases, yes.

OK, that makes sense.

> > [3.10 code base]
> > shrink_zone
> > [rmdir root]
> > mem_cgroup_iter(root, NULL, reclaim)
> > // prev = NULL
> > rcu_read_lock()
> > last_visited = iter->last_visited // gets root || NULL
> > css_tryget(last_visited) // failed
> > last_visited = NULL [1]
> > memcg = root = __mem_cgroup_iter_next(root, NULL)
> > iter->last_visited = root;
> > reclaim->generation = iter->generation
> >
> > mem_cgroup_iter(root, root, reclaim)
> > // prev = root
> > rcu_read_lock
> > last_visited = iter->last_visited // gets root
> > css_tryget(last_visited) // failed
> > [1]
> >
> > So we indeed can loop here without any progress. I just fail
> > to see how my patch could help. We even do not get down to
> > cgroup_next_descendant_pre.
> >
> > Or am I missing something?
>
> Your patch to 3.12 and 3.13 mem_cgroup_iter_next() doesn't help
> in 3.10 and 3.11, correct. That's why I appended a different patch,
> to mem_cgroup_iter(), for the 3.10 and 3.11 versions of the hang.
>
> >
> > The following should fix this kind of endless loop:
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 194721839cf5..168e5abcca92 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1221,7 +1221,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> > smp_rmb();
> > last_visited = iter->last_visited;
> > if (last_visited &&
> > - !css_tryget(&last_visited->css))
> > + last_visited != root &&
> > + !css_tryget(&last_visited->css))
> > last_visited = NULL;
> > }
> > }
> > @@ -1229,7 +1230,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> > memcg = __mem_cgroup_iter_next(root, last_visited);
> >
> > if (reclaim) {
> > - if (last_visited)
> > + if (last_visited && last_visited != root)
> > css_put(&last_visited->css);
> >
> > iter->last_visited = memcg;
>
> Right, that appears to fix 3.10, and seems a better alternative to the
> patch I suggested. I say "appears" because my success in reproducing
> the hang is variable, so when I see that it's "fixed" I cannot be
> quite sure.

Understood.

> I say "seems" because I think yours respects the intention
> of the iterator better than mine, but I've never been convinced that
> the iterator is as sensible as it intends in the face of races.
>
> At the bottom I've appended the version of yours that I've been trying
> on 3.11. I did succeed in reproducing the hang twice on 3.11.10.3
> (which I don't think differs in any essential from 3.11.0 for this issue,
> but after my lack of success with 3.11.0 I tried harder with that.)

git log points only at 3 patches in mm/memcontrol.c and all of them seem
unrelated.

> More so than in the 3.10 case, I haven't really given it long enough
> with the patch to really assert that it's good; and Greg Thelen came
> across a different reproduction case that I've yet to remind myself
> of and try, I'll have to report back to you later in the week when
> I've run that with your fix.

Great, thanks a lot for your testing. It is really appreciated
especially now that I am quite busy with other internal stuff.

> > Not that I like it much :/
>
> Well, I'm not in love with it, but I do think it's more appropriate
> than mine, if it really does fix the issues.

It fixes a potential endless loop. It is a question it is the one you
are seeing.

> It was only under questioning from you that we arrived at the belief
> that the problem is with the css_tryget of a root being removed: my
> patch was vaguer than that, not identifying the root cause.
>
> I suspect that the underlying problem is actually the "do {} while ()"
> nature of the iteration loops, instead of "while () {}"s.

I think the outside caller shouldn't care much. The iterator code has to
make sure that it doesn't loop itself. Doing while () {} has some issues
as well. Having a reason to reclaim but hen do not reclaim anything
might pop out as an issue upper in the calling stack.

> That places us (not for the first time) in the awkward position of
> having to supply something once (and once only) even when it doesn't
> really fit.
>
> (I have wondered whether making mem_cgroup_invalidate_reclaim_iterators
> visit the memcg as well as its parents, might provide another fix; nice
> if it did, but I doubt it, and have spent so much time fiddling around
> here that I've lost the will to try anything else.)

I do not see it as an easier alternative.

[...]
> > > > Cc: stable@xxxxxxxxxxxxxxx # 3.10+
> > >
> > > Well, I'm okay with that, if we use that as a way to shoehorn in the
> > > patch at the bottom instead for 3.10 and 3.11 stables.
> >
> > So far I do not see how it would make a change for those two kernels as
> > they have the special handling for root.
>
> That was my point: that patch does not fix 3.10 and 3.11 at all,
> but they suffer from the same problem (manifesting in a slightly
> different way, the hang revisiting mem_cgroup_iter repeatedly instead
> of being trapped inside it); so it may not be inappropriate to say 3.10+
> even though that particular patch will not apply and would not fix them.

OK, understood now. I will repost that patch with updated changelog
later.

> > [...]
> > > "Equivalent" patch for 3.10 or 3.11: fixing similar hangs but no leakage.
> > >
> > > Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx>
> > >
> > > --- v3.10/mm/memcontrol.c 2013-06-30 15:13:29.000000000 -0700
> > > +++ linux/mm/memcontrol.c 2014-01-15 18:18:24.476566659 -0800
> > > @@ -1226,7 +1226,8 @@ struct mem_cgroup *mem_cgroup_iter(struc
> > > }
> > > }
> > >
> > > - memcg = __mem_cgroup_iter_next(root, last_visited);
> > > + if (!prev || last_visited)
> > > + memcg = __mem_cgroup_iter_next(root, last_visited);
> >
> > I am confused. What would change between those two calls to change the
> > outcome? The function doesn't have any internal state.
>
> I don't understand your question (what two calls?).

OK, it was my selective blindness that stroke again here. Sorry about
the confusion.

With fresh eyes. Yes it would work as well.

> The 3.10 or 3.11
> __mem_cgroup_iter_next() begins with "if (!last_visited) return root;",
> which was problematic because again and again it would return root.
> Originally I passed in prev, and returned NULL instead of root if prev
> but !last_visited; but I've an aversion to passing a function an extra
> argument to say it shouldn't have been called, so in this version I'm
> testing !prev || last_visited before calling it. Perhaps your "two
> calls" are the first with prev == NULL and the second with prev == root.
>
> But I say I prefer your fix because mine above says nothing about root,
> which we now believe is the only problematic case. Mine would leave
> memcg NULL whenever a change resets last_visited to NULL (once one memcg
> has been delivered): which is simple, but not what the iterator intends
> (if I read it right, it wants to start again from the beginning, whereas
> I'm hastening it to the end). In practice mine works well, and I haven't
> seen the premature OOMs that you might suppose it leads to; but let's go
> for yours as more in keeping with the spirit of the iterator.

OK, let's keep it consistently ugly.

> "The spirit of the iterator", now that's a fine phrase.

:)

> Here's my 3.11 version of your 3.10, in case you spot something silly.
> I'll give it a try on Greg's testcase in coming days and report back.
> (Greg did suggest a different fix from mine back when he hit the issue,
> I'll also look that one out again in case it offers something better.)
>
> --- v3.11/mm/memcontrol.c 2014-01-19 14:16:38.656701990 -0800
> +++ linux/mm/memcontrol.c 2014-01-20 19:04:50.635637615 -0800
> @@ -1148,19 +1148,17 @@ mem_cgroup_iter_load(struct mem_cgroup_r
> if (iter->last_dead_count == *sequence) {
> smp_rmb();
> position = iter->last_visited;
> - if (position && !css_tryget(&position->css))
> + if (position && position != root &&
> + !css_tryget(&position->css))
> position = NULL;
> }
> return position;
> }
>
> static void mem_cgroup_iter_update(struct mem_cgroup_reclaim_iter *iter,
> - struct mem_cgroup *last_visited,
> struct mem_cgroup *new_position,
> int sequence)
> {
> - if (last_visited)
> - css_put(&last_visited->css);
> /*
> * We store the sequence count from the time @last_visited was
> * loaded successfully instead of rereading it here so that we
> @@ -1234,7 +1232,10 @@ struct mem_cgroup *mem_cgroup_iter(struc
> memcg = __mem_cgroup_iter_next(root, last_visited);
>
> if (reclaim) {
> - mem_cgroup_iter_update(iter, last_visited, memcg, seq);
> + if (last_visited && last_visited != root)
> + css_put(&last_visited->css);
> +
> + mem_cgroup_iter_update(iter, memcg, seq);
>
> if (!memcg)
> iter->generation++;

Yes it looks good. Although I would probably go and add root into
mem_cgroup_iter_update and do the check and css_put there to have
it symmetric with mem_cgroup_iter_load. I will cook up a changelog for
this one as well (for both 3.10 and 3.11 because they share fail on root
case).

Thanks a lot!
--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/