Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

From: Mel Gorman
Date: Mon Jan 23 2017 - 05:49:22 EST


On Sun, Jan 22, 2017 at 06:45:59PM -0600, Trevor Cordes wrote:
> On 2017-01-20 Mel Gorman wrote:
> > >
> > > Thanks for the OOM report. I was expecting it to be a particular
> > > shape and my expectations were not matched so it took time to
> > > consider it further. Can you try the cumulative patch below? It
> > > combines three patches that
> > >
> > > 1. Allow slab shrinking even if the LRU patches are unreclaimable in
> > > direct reclaim
> > > 2. Shrinks slab based once based on the contents of all memcgs
> > > instead of shrinking one at a time
> > > 3. Tries to shrink slabs if the lowmem usage is too high
> > >
> > > Unfortunately it's only boot tested on x86-64 as I didn't get the
> > > chance to setup an i386 test bed.
> > >
> >
> > There was one major flaw in that patch. This version fixes it and
> > addresses other minor issues. It may still be too agressive shrinking
> > slab but worth trying out. Thanks.
>
> I ran with your patch below and it oom'd on the first night. It was
> weird, it didn't hang the system, and my rebooter script started a
> reboot but the system never got more than half down before it just sat
> there in a weird state where a local console user could still login but
> not much was working. So the patches don't seem to solve the problem.
>
> For the above compile I applied your patches to 4.10.0-rc4+, I hope
> that's ok.
>

It would be strongly preferred to run them on top of Michal's other
fixes. The main reason it's preferred is because this OOM differs from
earlier ones in that it OOM killed from GFP_NOFS|__GFP_NOFAIL context.
That meant that the slab shrinking could not happen from direct reclaim so
the balancing from my patches would not occur. As Michal's other patches
affect how kswapd behaves, it's important.

Unfortunately, even that will be race prone for GFP_NOFS callers as
they'll effectively be racing to see if kswapd or another direct
reclaimer can reclaim before the OOM conditions are hit. It is by
design, but it's apparent that a __GFP_NOFAIL request can trigger OOM
relatively easily as it's not necessarily throttled or waiting on kswapd
to complete any work. I'll keep thinking about it.

--
Mel Gorman
SUSE Labs