Re: Memory controller merge (was Re: -mm merge plans for 2.6.24)

From: Hugh Dickins
Date: Sun Oct 07 2007 - 13:44:07 EST


On Fri, 5 Oct 2007, Balbir Singh wrote:
> Hugh Dickins wrote:
> >
> > That's where it should happen, yes; but my point is that it very
> > often does not. Because the swap cache page (read in as part of
> > the readaround cluster of some other cgroup, or in swapoff by some
> > other cgroup) is already assigned to that other cgroup (by the
> > mem_cgroup_cache_charge in __add_to_swap_cache), and so goes "The
> > page_cgroup exists and the page has already been accounted" route
> > when mem_cgroup_charge is called from do_swap_page. Doesn't it?
> >
>
> You are right, at this point I am beginning to wonder if I should
> account for the swap cache at all? We account for the pages in RSS
> and when the page comes back into the page table(s) via do_swap_page.
> If we believe that the swap cache is transitional and the current
> expected working behaviour does not seem right or hard to fix,
> it might be easy to ignore unuse_pte() and add/remove_from_swap_cache()
> for accounting and control.

It would be wrong to ignore the unuse_pte() case: what it's intending
to do is correct, it's just being prevented by the swapcache issue
from doing what it intends at present.

(Though I'm not thrilled with the idea of it causing an admin's
swapoff to fail because of a cgroup reaching mem limit there, I do
agree with your earlier argument that that's the right thing to happen,
and it's up to the admin to fix things up - my original objection came
from not realizing that normally the cgroup will reclaim from itself
to free its mem. Hmm, would the charge fail or the mm get OOM'ed?)

Ignoring add_to/remove_from swap cache is what I've tried before,
and again today. It's not enough: if you trying run a memhog
(something that allocates and touches more memory than the cgroup
is allowed, relying on pushing out to swap to complete), then that
works well with the present accounting in add_to/remove_from swap
cache, but it OOMs once I remove the memcontrol mods from
mm/swap_state.c. I keep going back to investigate why, keep on
thinking I understand it, then later realize I don't. Please
give it a try, I hope you've got better mental models than I have.

And I don't think it will be enough to handle shmem/tmpfs either;
but won't worry about that until we've properly understood why
exempting swapcache leads to those OOMs, and fixed that up.

> > Are we misunderstanding each other, because I'm assuming
> > MEM_CGROUP_TYPE_ALL and you're assuming MEM_CGROUP_TYPE_MAPPED?
> > though I can't see that _MAPPED and _CACHED are actually supported,
> > there being no reference to them outside the enum that defines them.
>
> I am also assuming MEM_CGROUP_TYPE_ALL for the purpose of our
> discussion. The accounting is split into mem_cgroup_charge() and
> mem_cgroup_cache_charge(). While charging the caches is when we
> check for the control_type.

It checks MEM_CGROUP_TYPE_ALL there, yes; but I can't find anything
checking for either MEM_CGROUP_TYPE_MAPPED or MEM_CGROUP_TYPE_CACHED.
(Or is it hidden in one of those preprocesor ## things which frustrate
both my greps and me!?)

> > Or are you deceived by that ifdef NUMA code in swapin_readahead,
> > which propagates the fantasy that swap allocation follows vma layout?
> > That nonsense has been around too long, I'll soon be sending a patch
> > to remove it.
>
> The swapin readahead code under #ifdef NUMA is very confusing.

I sent a patch to linux-mm last night, to remove that confusion.

> I also
> noticed another confusing thing during my test, swap cache does not
> drop to 0, even though I've disabled all swap using swapoff. May be
> those are tmpfs pages. The other interesting thing I tried was running
> swapoff after a cgroup went over it's limit, the swapoff succeeded,
> but I see strange numbers for free swap. I'll start another thread
> after investigating a bit more.

Those indeed are strange behaviours (if the swapoff really has
succeeded, rather than lying), I not seen such and don't have an
explanation. tmpfs doesn't add any weirdness there: when there's
no swap, there can be no swap cache. Or is the swapoff still in
progress? While it's busy, we keep /proc/meminfo looking sensible,
but <Alt><SysRq>m can show negative free swap (IIRC).

I'll be interested to hear what your investigation shows.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/