Re: [patch 9/9] mm: keep page cache radix tree nodes in check

From: Dave Chinner
Date: Mon Jan 20 2014 - 22:04:53 EST


On Mon, Jan 20, 2014 at 06:17:37PM -0500, Johannes Weiner wrote:
> On Fri, Jan 17, 2014 at 11:05:17AM +1100, Dave Chinner wrote:
> > On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> > > + /* Only shadow entries in there, keep track of this node */
> > > + if (!(node->count & RADIX_TREE_COUNT_MASK) &&
> > > + list_empty(&node->private_list)) {
> > > + node->private_data = mapping;
> > > + list_lru_add(&workingset_shadow_nodes, &node->private_list);
> > > + }
> >
> > You can't do this list_empty(&node->private_list) check safely
> > externally to the list_lru code - only time that entry can be
> > checked safely is under the LRU list locks. This is the reason that
> > list_lru_add/list_lru_del return a boolean to indicate is the object
> > was added/removed from the list - they do this list_empty() check
> > internally. i.e. the correct, safe way to do conditionally update
> > state iff the object was added to the LRU is:
> >
> > if (!(node->count & RADIX_TREE_COUNT_MASK)) {
> > if (list_lru_add(&workingset_shadow_nodes, &node->private_list))
> > node->private_data = mapping;
> > }
> >
> > > + radix_tree_replace_slot(slot, page);
> > > + mapping->nrpages++;
> > > + if (node) {
> > > + node->count++;
> > > + /* Installed page, can't be shadow-only anymore */
> > > + if (!list_empty(&node->private_list))
> > > + list_lru_del(&workingset_shadow_nodes,
> > > + &node->private_list);
> > > + }
> >
> > Same issue here:
> >
> > if (node) {
> > node->count++;
> > list_lru_del(&workingset_shadow_nodes, &node->private_list);
> > }
>
> All modifications to node->private_list happen under
> mapping->tree_lock, and modifications of a neighboring link should not
> affect the outcome of the list_empty(), so I don't think the lru lock
> is necessary.

Can you please add that as a comment somewhere explaining why it is
safe to do this?

> > > + case LRU_REMOVED_RETRY:
> > > if (--nlru->nr_items == 0)
> > > node_clear(nid, lru->active_nodes);
> > > WARN_ON_ONCE(nlru->nr_items < 0);
> > > isolated++;
> > > + /*
> > > + * If the lru lock has been dropped, our list
> > > + * traversal is now invalid and so we have to
> > > + * restart from scratch.
> > > + */
> > > + if (ret == LRU_REMOVED_RETRY)
> > > + goto restart;
> > > break;
> > > case LRU_ROTATE:
> > > list_move_tail(item, &nlru->list);
> >
> > I think that we need to assert that the list lru lock is correctly
> > held here on return with LRU_REMOVED_RETRY. i.e.
> >
> > case LRU_REMOVED_RETRY:
> > assert_spin_locked(&nlru->lock);
> > case LRU_REMOVED:
>
> Ah, good idea. How about adding it to LRU_RETRY as well?

Yup, good idea.

> > > +static struct shrinker workingset_shadow_shrinker = {
> > > + .count_objects = count_shadow_nodes,
> > > + .scan_objects = scan_shadow_nodes,
> > > + .seeks = DEFAULT_SEEKS * 4,
> > > + .flags = SHRINKER_NUMA_AWARE,
> > > +};
> >
> > Can you add a comment explaining how you calculated the .seeks
> > value? It's important to document the weighings/importance
> > we give to slab reclaim so we can determine if it's actually
> > acheiving the desired balance under different loads...
>
> This is not an exact science, to say the least.

I know, that's why I asked it be documented rather than be something
kept in your head.

> The shadow entries are mostly self-regulated, so I don't want the
> shrinker to interfere while the machine is just regularly trimming
> caches during normal operation.
>
> It should only kick in when either a) reclaim is picking up and the
> scan-to-reclaim ratio increases due to mapped pages, dirty cache,
> swapping etc. or b) the number of objects compared to LRU pages
> becomes excessive.
>
> I think that is what most shrinkers with an elevated seeks value want,
> but this translates very awkwardly (and not completely) to the current
> cost model, and we should probably rework that interface.
>
> "Seeks" currently encodes 3 ratios:
>
> 1. the cost of creating an object vs. a page
>
> 2. the expected number of objects vs. pages

It doesn't encode that at all. If it did, then the default value
wouldn't be "2".

> 3. the cost of reclaiming an object vs. a page

Which, when you consider #3 in conjunction with #1, the actual
intended meaning of .seeks is "the cost of replacing this object in
the cache compared to the cost of replacing a page cache page."

> but they are not necessarily correlated. How I would like to
> configure the shadow shrinker instead is:
>
> o scan objects when reclaim efficiency is down to 75%, because they
> are more valuable than use-once cache but less than workingset
>
> o scan objects when the ratio between them and the number of pages
> exceeds 1/32 (one shadow entry for each resident page, up to 64
> entries per shrinkable object, assume 50% packing for robustness)
>
> o as the expected balance between objects and lru pages is 1:32,
> reclaim one object for every 32 reclaimed LRU pages, instead of
> assuming that number of scanned pages corresponds meaningfully to
> number of objects to scan.

You're assuming that every radix tree node has a full population of
pages. This only occurs on sequential read and write workloads, and
so isn't going tobe true for things like mapped executables or any
semi-randomly accessed data set...

> "4" just doesn't have the same ring to it.

Right, but you still haven't explained how you came to the value of
"4"....

> It would be great if we could eliminate the reclaim cost assumption by
> turning the nr_to_scan into a nr_to_reclaim, and then set the other
> two ratios independently.

That doesn't work for caches that are full of objects that can't (or
won't) be reclaimed immediately. The CPU cost of repeatedly scanning
to find N reclaimable objects when you have millions of objects in
the cache is prohibitive.

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/