Re: Memory controller merge (was Re: -mm merge plans for 2.6.24)
From: Hugh Dickins
Date: Wed Oct 03 2007 - 14:48:42 EST
On Wed, 3 Oct 2007, Balbir Singh wrote:
> Hugh Dickins wrote:
> > Sorry, Balbir, I've failed to get back to you, still attending to
> > priorities. Let me briefly summarize my issue with the mem controller:
> > you've not yet given enough attention to swap.
> I am open to suggestions and ways and means of making swap control
> complete and more usable.
Well, swap control is another subject. I guess for that you'll need
to track which cgroup each swap page belongs to (rather more expensive
than the current swap_map of unsigned shorts). And I doubt it'll be
swap control as such that's required, but control of rss+swap.
But here I'm just worrying about how the existence of swap makes
something of a nonsense of your rss control.
> > I accept that full swap control is something you're intending to add
> > incrementally later; but the current state doesn't make sense to me.
> > The problems are swapoff and swapin readahead. These pull pages into
> > the swap cache, which are assigned to the cgroup (or the whatever-we-
> > call-the-remainder-outside-all-the-cgroups) which is running swapoff
I'd appreciate it if you'd teach me the right name for that!
> > or faulting in its own page; yet they very clearly don't (in general)
> > belong to that cgroup, but to other cgroups which will be discovered
> > later.
> I understand what your trying to say, but with several approaches that
> we tried in the past, we found caches the hardest to most accurately
> account. IIRC, with readahead, we don't even know if all the pages
> readahead will be used, that's why we charge everything to the cgroup
> that added the page to the cache.
Yes, readahead is anyway problematic. My guess is that in the file
cache case, you'll tend not to go too far wrong by charging to the
one that added - though we're all aware that's fairly unsatisfactory.
My point is that in the swap cache case, it's badly wrong: there's
no page more obviously owned by a cgroup than its anonymous pages
(forgetting for a moment that minority shared between cgroups
until copy-on-write), so it's very wrong for swapin readahead
or swapoff to go charging those to another or to no cgroup.
Imagine a cgroup at its rss limit, with more out on swap. Then
another cgroup does some swap readahead, bringing pages private
to the first into cache. Or runs swapoff which actually plugs
them into the rss of the first cgroup, so it goes over limit.
Those are pages we'd want to swap out when the first cgroup
faults to go further over its limit; but they're now not even
identified as belonging to the right cgroup, so won't be found.
> > I did try removing the cgroup mods to mm/swap_state.c, so swap pages
> > get assigned to a cgroup only once it's really known; but that's not
> > enough by itself, because cgroup RSS reclaim doesn't touch those
> > pages, so the cgroup can easily OOM much too soon. I was thinking
> > that you need a "limbo" cgroup for these pages, which can be attacked
> > for reclaim along with any cgroup being reclaimed, but from which
> > pages are readily migrated to their real cgroup once that's known.
> Is migrating the charge to the real cgroup really required?
My answer is definitely yes. I'm not suggesting that you need
general migration between cgroups at this stage (something for
later quite likely); but I am suggesting you need one pseudo-cgroup
to hold these cases temporarily, and that you cannot properly track
rss without it (if there is any swap).
> > But I had to switch over to other work before trying that out:
> > perhaps the idea doesn't really fly at all. And it might well
> > be no longer needed once full mem+swap control is there.
> > So in the current memory controller, that unuse_pte mem charge I was
> > originally worried about failing (I hadn't at that point delved in
> > to see how it tries to reclaim) actually never fails (and never
> > does anything): the page is already assigned to some cgroup-or-
> > whatever and is never charged to vma->vm_mm at that point.
Umm, please explain what's excellent about that.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/