Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

From: Mel Gorman
Date: Mon Jan 30 2017 - 04:13:56 EST


On Sun, Jan 29, 2017 at 04:50:03PM -0600, Trevor Cordes wrote:
> On 2017-01-25 Michal Hocko wrote:
> > On Wed 25-01-17 04:02:46, Trevor Cordes wrote:
> > > OK, I patched & compiled mhocko's git tree from the other day
> > > 4.9.0+. (To confirm, weird, but mhocko's git tree I'm using from a
> > > couple of weeks ago shows the newest commit (git log) is
> > > 69973b830859bc6529a7a0468ba0d80ee5117826 "Linux 4.9"? Let me know
> > > if I'm doing something wrong, see below.)
> >
> > My fault. I should have noted that you should use since-4.9 branch.
>
> OK, I have good news. I compiled your mhocko git tree (properly this
> tim!) using since-4.9 branch (last commit
> ca63ff9b11f958efafd8c8fa60fda14baec6149c Jan 25) and the box survived 3
> 3am's, over 60 hours, and I made sure all the usual oom culprits ran,
> and I ran extras (finds on the whole tree, extra rdiff-backups) to try
> to tax it. Based on my previous criteria I would say your since-4.9 as
> of the above commit solves my bug, at least over a 3 day test span
> (which it never survives when the bug is present)!
>

That's good news. It means the more extreme options may not be
necessary.

> I tested WITHOUT any cgroup/mem boot options. I do still have my
> mem=6G limiter on, though (I've never tested with it off, until I solve
> the bug with it on, since I've had it on for many months for other
> reasons).
>

It may be an option to try relaxing that and see at what point it fails.
You may find at some point that memory is not utilised as there is not
enough lowmem for metadata to track data in highmem. That's not unexpected.

> What do I test next? Does the since-4.9 stuff get pushed into vanilla
> (4.9 hopefully?) so it can find its way into Fedora's stuck F24
> kernel?
>

Michal has already made suggestions here and I've nothing to add.

> I want to also note that the RHBZ
> https://bugzilla.redhat.com/show_bug.cgi?id=1401012 is garnering more
> interest as more people start me-too'ing. The situation is almost
> always the same: large rsync's or similar tree-scan accesses cause oom
> on PAE boxes. However, I wanted to note that many people there reported
> that cgroup_disable=memory doesn't fix anything for them, whereas that
> always makes the problem go away on my boxes. Strange.
>

It could simply be down to whether memcgs were actually in use or not.

> Thanks Michal and Mel, I really appreciate it!

I appreciate the detailed testing and reporting!

--
Mel Gorman
SUSE Labs