Re: [patch 0/8] mm: thrash detection-based file cache sizing v5

From: Johannes Weiner
Date: Wed Oct 16 2013 - 18:32:05 EST


On Wed, Oct 16, 2013 at 01:26:06PM +1100, Dave Chinner wrote:
> On Tue, Oct 15, 2013 at 10:05:26PM -0400, Rik van Riel wrote:
> > On 10/15/2013 07:41 PM, Dave Chinner wrote:
> > > On Tue, Oct 15, 2013 at 01:41:28PM -0400, Johannes Weiner wrote:
> >
> > >> I'm not forgetting about them, I just track them very coarsely by
> > >> linking up address spaces and then lazily enforce their upper limit
> > >> when memory is tight by using the shrinker callback. The assumption
> > >> was that actually scanning them is such a rare event that we trade the
> > >> rare computational costs for smaller memory consumption most of the
> > >> time.
> > >
> > > Sure, I understand the tradeoff that you made. But there's nothing
> > > worse than a system that slows down unpredictably because of some
> > > magic threshold in some subsystem has been crossed and
> > > computationally expensive operations kick in.
> >
> > The shadow shrinker should remove the radix nodes with
> > the oldest shadow entries first, so true LRU should actually
> > work for the radix tree nodes.
> >
> > Actually, since we only care about the age of the youngest
> > shadow entry in each radix tree node, FIFO will be the same
> > as LRU for that list.
> >
> > That means the shrinker can always just take the radix tree
> > nodes off the end.
>
> Right, but it can't necessarily free the node as it may still have
> pointers to pages in it. In that case, it would have to simply
> rotate the page to the end of the LRU again.
>
> Unless, of course, we kept track of the number of exceptional
> entries in a node and didn't add it to the reclaim list until there
> were no non-expceptional entries in the node....

I think that's doable. As long as there is an actual page in the
node, no memory is going to be freed anyway. And we have a full int
per node with only 64 slots, we can easily split the counter.

> > >> But it
> > >> looks like tracking radix tree nodes with a list and backpointers to
> > >> the mapping object for the lock etc. will be a major pain in the ass.
> > >
> > > Perhaps so - it may not work out when we get down to the fine
> > > details...
> >
> > I suspect that a combination of lifetime rules (inode cannot
> > disappear until all the radix tree nodes) and using RCU free
> > for the radix tree nodes, and the inodes might do the trick.
> >
> > That would mean that, while holding the rcu read lock, the
> > back pointer from a radix tree node to the inode will always
> > point to valid memory.
>
> Yes, that is what I was thinking...
>
> > That allows the shrinker to lock the inode, and verify that
> > the inode is still valid, before it attempts to rcu free the
> > radix tree node with shadow entries.
>
> Lock the mapping, not the inode. The radix tree is protected by the
> mapping_lock, not an inode lock. i.e. I'd hope that this can all b
> contained within the struct address_space and not require any
> knowledge of inodes or inode lifecycles at all.

Agreed, we can point to struct address_space and invalidate it by
setting mapping->host to NULL or so during the RCU grace period.

Also, the parent pointer is in a union with the two-word rcu_head, so
we get the address space backpointer for free.

The struct list_head for the FIFO, as you said, we can get for free as
well (at least on slab).

The FIFO lists themselves can be trimmed by a shrinker, I think. You
were concerned about wind-up but if the nodes are not in excess and
->count just returns 0 until we are actually prepared to shrink
objects, then there shouldn't be any windup, right?

I don't see a natural threshold for "excess" yet, though, because
there is no relationship between where radix_tree_node is allocated
and which zones the contained shadow entries point to, so it's hard to
estimate how worthwile any given radix_tree_node is to a node. Maybe
tying it to an event might be better, like reclaim pressure, swapping
etc.

> > It also means that locking only needs to be in the inode,
> > and on the LRU list for shadow radix tree nodes.
> >
> > Does that sound sane?
> >
> > Am I overlooking something?
>
> It's pretty much along the same lines of what I was thinking, but
> lets see what Johannes thinks.

That sounds great to me. I just have looked at this code for so long
that I'm getting blind towards it, so I really appreciate your input.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/