Re: [rfc] superblock shrinker accumulating excessive deferred counts

From: Dave Chinner
Date: Mon Jul 17 2017 - 17:51:09 EST


On Mon, Jul 17, 2017 at 01:37:35PM -0700, David Rientjes wrote:
> On Mon, 17 Jul 2017, Dave Chinner wrote:
>
> > > This is a side effect of super_cache_count() returning the appropriate
> > > count but super_cache_scan() refusing to do anything about it and
> > > immediately terminating with SHRINK_STOP, mostly for GFP_NOFS allocations.
> >
> > Yup. Happens during things like memory allocations in filesystem
> > transaction context. e.g. when your memory pressure is generated by
> > GFP_NOFS allocations within transactions whilst doing directory
> > traversals (say 'chown -R' across an entire filesystem), then we
> > can't do direct reclaim on the caches that are generating the memory
> > pressure and so have to defer all the work to either kswapd or the
> > next GFP_KERNEL allocation context that triggers reclaim.
> >
>
> Thanks for looking into this, Dave!
>
> The number of GFP_NOFS allocations that build up the deferred counts can
> be unbounded, however, so this can become excessive, and the oom killer
> will not kill any processes in this context. Although the motivation to
> do additional reclaim because of past GFP_NOFS reclaim attempts is
> worthwhile, I think it should be limited because currently it only
> increases until something is able to start draining these excess counts.

Usually kswapd is kicked in by this point and starts doing work. Why
isn't kswapd doing the shrinker work in the background?

> Having 10,000 GFP_NOFS reclaim attempts store up
> (2 * nr_scanned * freeable) / (nr_eligible + 1) objects 10,000 times
> such that it exceeds freeable by many magnitudes doesn't seem like a
> particularly useful thing. For reference, we have seen nr_deferred for a
> single node to be > 10,000,000,000 in practice.

What is the workload, and where is that much GFP_NOFS allocation
coming from?

> total_scan is limited to
> 2 * freeable for each call to do_shrink_slab(), but such an excessive
> deferred count will guarantee it retries 2 * freeable each time instead of
> the proportion of lru scanned as intended.
>
> What breaks if we limit the nr_deferred counts to freeable * 4, for
> example?

No solutions are viable until the cause of the windup is known and
understood....

> > Can you post a shrinker trace that shows the deferred count wind
> > up and then display the problem you're trying to describe?
> >
>
> All threads contending on the list_lru's nlru->lock because they are all
> stuck in super_cache_count() while one thread is iterating through an
> excessive number of deferred objects in super_cache_scan(), contending for
> the same locks and nr_deferred never substantially goes down.

Ugh. The per-node lru list count was designed to run unlocked and so
avoid this sort of (known) scalability problem.

Ah, see the difference between list_lru_count_node() and
list_lru_count_one(). list_lru_count_one() should only take locks
for memcg lookups if it is trying to shrink a memcg. That needs to
be fixed before anything else and, if possible, the memcg lookup be
made lockless....

IIRC, the memcg shrinkers all set sc->nid = 0, as the memcg LRUs are
not per-node lists - they are just a single linked lists and so
there are other scalability problems with memcgs, too.

> The problem with the superblock shrinker, which is why I emailed Al
> originally, is also that it is SHRINKER_MEMCG_AWARE. Our
> list_lru_shrink_count() is only representative for the list_lru of
> sc->memcg, which is used in both super_cache_count() and
> super_cache_scan() for various math. The nr_deferred counts from the
> do_shrink_slab() logic, however, are per-nid and, as such, various memcgs
> get penalized with excessive counts that they do not have freeable to
> begin with.

Yup, the memcg shrinking was shoe-horned into the per-node LRU
infrastructure, and the high level accounting is completely unaware
of the fact that memcgs have their own private LRUs. We left the
windup in place because slab caches are shared, and it's possible
that memory can't be freed because pages have objects from different
memcgs pinning them. Hence we need to bleed at least some of that
"we can't make progress" count back into the global "deferred
reclaim" pool to get other contexts to do some reclaim.

Perhaps that's the source of the problem - memcgs have nasty
behaviours when they have very little reclaimable objects (look at
all the "we need to ve able to reclaim every single object" fixes),
so I would not be surprised if it's a single memcg under extreme
memory pressure that is causing windups. Still, I think the lock
contention problems should be sorted first - removing the shrinker
serialisation will change behaviour significantly in these
situations.

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx