Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

From: KAMEZAWA Hiroyuki
Date: Wed Mar 23 2011 - 05:13:22 EST


On Fri, 18 Mar 2011 00:57:09 -0700
Greg Thelen <gthelen@xxxxxxxxxx> wrote:

> On Thu, Mar 17, 2011 at 5:43 AM, Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> > On Wed, Mar 16, 2011 at 09:41:48PM -0700, Greg Thelen wrote:
> >> In '[PATCH v6 8/9] memcg: check memcg dirty limits in page writeback' Jan and
> >> Vivek have had some discussion around how memcg and writeback mesh.
> >> In my mind, the
> >> discussions in 8/9 are starting to blend with this thread.
> >>
> >> I have been thinking about Johannes' struct memcg_mapping. ÂI think the idea
> >> may address several of the issues being discussed, especially
> >> interaction between
> >> IO-less balance_dirty_pages() and memcg writeback.
> >>
> >> Here is my thinking. ÂFeedback is most welcome!
> >>
> >> The data structures:
> >> - struct memcg_mapping {
> >> Â Â Â Âstruct address_space *mapping;
> >> Â Â Â Âstruct mem_cgroup *memcg;
> >> Â Â Â Âint refcnt;
> >> Â };
> >> - each memcg contains a (radix, hash_table, etc.) mapping from bdi to memcg_bdi.
> >> - each memcg_bdi contains a mapping from inode to memcg_mapping. ÂThis may be a
> >> Â very large set representing many cached inodes.
> >> - each memcg_mapping represents all pages within an bdi,inode,memcg. ÂAll
> >> Â corresponding cached inode pages point to the same memcg_mapping via
> >> Â pc->mapping. ÂI assume that all pages of inode belong to no more than one bdi.
> >> - manage a global list of memcg that are over their respective background dirty
> >> Â limit.
> >> - i_mapping continues to point to a traditional non-memcg mapping (no change
> >> Â here).
> >> - none of these memcg_* structures affect root cgroup or kernels with memcg
> >> Â configured out.
> >
> > So structures roughly like this:
> >
> > struct mem_cgroup {
> > Â Â Â Â...
> > Â Â Â Â/* key is struct backing_dev_info * */
> > Â Â Â Âstruct rb_root memcg_bdis;
> > };
> >
> > struct memcg_bdi {
> > Â Â Â Â/* key is struct address_space * */
> > Â Â Â Âstruct rb_root memcg_mappings;
> > Â Â Â Âstruct rb_node node;
> > };
> >
> > struct memcg_mapping {
> > Â Â Â Âstruct address_space *mapping;
> > Â Â Â Âstruct mem_cgroup *memcg;
> > Â Â Â Âstruct rb_node node;
> > Â Â Â Âatomic_t count;
> > };
> >
> > struct page_cgroup {
> > Â Â Â Â...
> > Â Â Â Âstruct memcg_mapping *memcg_mapping;
> > };
> >
> >> The routines under discussion:
> >> - memcg charging a new inode page to a memcg: will use inode->mapping and inode
> >> Â to walk memcg -> memcg_bdi -> memcg_mapping and lazily allocating missing
> >> Â levels in data structure.
> >>
> >> - Uncharging a inode page from a memcg: will use pc->mapping->memcg to locate
> >> Â memcg. ÂIf refcnt drops to zero, then remove memcg_mapping from the memcg_bdi.
> >> Â Also delete memcg_bdi if last memcg_mapping is removed.
> >>
> >> - account_page_dirtied(): nothing new here, continue to set the per-page flags
> >> Â and increment the memcg per-cpu dirty page counter. ÂSame goes for routines
> >> Â that mark pages in writeback and clean states.
> >
> > We may want to remember the dirty memcg_mappings so that on writeback
> > we don't have to go through every single one that the memcg refers to?
>
> I think this is a good idea to allow per memcg per bdi list of dirty mappings.
>
> It feels like some of this is starting to gel. I've been sketching
> some of the code to see how the memcg locking will work out. The
> basic structures I see are:
>
> struct mem_cgroup {
> ...
> /*
> * For all file pages cached by this memcg sort by bdi.
> * key is struct backing_dev_info *; value is struct memcg_bdi *
> * Protected by bdis_lock.
> */
> struct rb_root bdis;
> spinlock_t bdis_lock; /* or use rcu structure, memcg:bdi set
> could be fairly static */
> };
>
> struct memcg_bdi {
> struct backing_dev_info *bdi;
> /*
> * key is struct address_space *; value is struct
> memcg_mapping *
> * memcg_mappings live within either mappings or
> dirty_mappings set.
> */
> struct rb_root mappings;
> struct rb_root dirty_mappings;
> struct rb_node node;
> spinlock_t lock; /* protect [dirty_]mappings */
> };
>
> struct memcg_mapping {
> struct address_space *mapping;
> struct memcg_bdi *memcg_bdi;
> struct rb_node node;
> atomic_t nr_pages;
> atomic_t nr_dirty;
> };
>
> struct page_cgroup {
> ...
> struct memcg_mapping *memcg_mapping;
> };
>
> - each memcg contains a mapping from bdi to memcg_bdi.
> - each memcg_bdi contains two mappings:
> mappings: from address_space to memcg_mapping for clean pages
> dirty_mappings: from address_space to memcg_mapping when there are
> some dirty pages
> - each memcg_mapping represents a set of cached pages within an
> bdi,inode,memcg. All
> corresponding cached inode pages point to the same memcg_mapping via
> pc->mapping. I assume that all pages of inode belong to no more than one bdi.
> - manage a global list of memcg that are over their respective background dirty
> limit.
> - i_mapping continues to point to a traditional non-memcg mapping (no change
> here).
> - none of these memcg_* structures affect root cgroup or kernels with memcg
> configured out.
>
> The routines under discussion:
> - memcg charging a new inode page to a memcg: will use inode->mapping and inode
> to walk memcg -> memcg_bdi -> mappings and lazily allocating missing
> levels in data structure.
>
> - Uncharging a inode page from a memcg: will use pc->mapping->memcg to locate
> memcg. If refcnt drops to zero, then remove memcg_mapping from the
> memcg_bdi.[dirty_]mappings.
> Also delete memcg_bdi if last memcg_mapping is removed.
>
> - account_page_dirtied(): increment nr_dirty. If first dirty page,
> then move memcg_mapping from memcg_bdi.mappings to
> memcg_bdi.dirty_mappings page counter. When marking page clean, do
> the opposite.
>
> - mem_cgroup_balance_dirty_pages(): if memcg dirty memory usage if above
> background limit, then add memcg to global memcg_over_bg_limit list and use
> memcg's set of memcg_bdi to wakeup each(?) corresponding bdi flusher. If over
> fg limit, then use IO-less style foreground throttling with per-memcg per-bdi
> (aka memcg_bdi) accounting structure.
>
> - bdi writeback: will revert some of the mmotm memcg dirty limit changes to
> fs-writeback.c so that wb_do_writeback() will return to checking
> wb_check_background_flush() to check background limits and being
> interruptible if sync flush occurs. wb_check_background_flush() will
> check the global
> memcg_over_bg_limit list for memcg that are over their dirty limit.
> Within each memcg write inodes from the dirty_mappings list until a
> threshold page count has been reached (MAX_WRITEBACK_PAGES). Then
> move to next listed memcg.
>
> - over_bground_thresh() will determine if memcg is still over bg limit.
> If over limit, then it per bdi per memcg background flushing will continue.
> If not over limit then memcg will be removed from memcg_over_bg_limit list.
>
> I'll post my resulting patches in RFC form, or (at the least) my conclusions.
>
please take care of force_empty and move_mapping at el. when you do this
and please do rmdir() tests.


Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/