Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

From: Brian Foster
Date: Mon Feb 06 2017 - 10:47:40 EST


On Mon, Feb 06, 2017 at 03:42:22PM +0100, Michal Hocko wrote:
> On Mon 06-02-17 09:35:33, Brian Foster wrote:
> > On Mon, Feb 06, 2017 at 03:29:24PM +0900, Tetsuo Handa wrote:
> > > Brian Foster wrote:
> > > > On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote:
> > > > > [Let's CC more xfs people]
> > > > >
> > > > > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
> > > > > [...]
> > > > > > (1) I got an assertion failure.
> > > > >
> > > > > I suspect this is a result of
> > > > > http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@xxxxxxxxxx
> > > > > I have no idea what the assert means though.
> > > > >
> > > > > >
> > > > > > [ 969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB
> > > > > > [ 969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > > > > > [ 972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867
> > > >
> > > > Indirect block reservation underrun on delayed allocation extent merge.
> > > > These are extra blocks are used for the inode bmap btree when a delalloc
> > > > extent is converted to physical blocks. We're in a case where we expect
> > > > to only ever free excess blocks due to a merge of extents with
> > > > independent reservations, but a situation occurs where we actually need
> > > > blocks and hence the assert fails. This can occur if an extent is merged
> > > > with one that has a reservation less than the expected worst case
> > > > reservation for its size (due to previous extent splits due to hole
> > > > punches, for example). Therefore, I think the core expectation that
> > > > xfs_bmap_add_extent_hole_delay() will always have enough blocks
> > > > pre-reserved is invalid.
> > > >
> > > > Can you describe the workload that reproduces this? FWIW, I think the
> > > > way xfs_bmap_add_extent_hole_delay() currently works is likely broken
> > > > and have a couple patches to fix up indlen reservation that I haven't
> > > > posted yet. The diff that deals with this particular bit is appended.
> > > > Care to give that a try?
> > >
> > > The workload is to write to a single file on XFS from 10 processes demonstrated at
> > > http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@xxxxxxxxxxxxxxxxxxx
> > > using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM.
> > > With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion failures.
> > >
> >
> > Thanks for testing. Well, that's an interesting workload. I couldn't
> > reproduce on a few quick tries in a similarly configured vm.
> >
> > Normally I'd expect to see this kind of thing on a hole punching
> > workload or dealing with large, sparse files that make use of
> > speculative preallocation (post-eof blocks allocated in anticipation of
> > file extending writes). I'm wondering if what is happening here is that
> > the appending writes and file closes due to oom kills are generating
> > speculative preallocs and prealloc truncates, respectively, and that
> > causes prealloc extents at the eof boundary to be split up and then
> > re-merged by surviving appending writers.
>
> Can those preallocs be affected by
> http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@xxxxxxxxxx ?
>

Hmm, I wouldn't expect that to make much of a difference wrt to the core
problem. The prealloc is created on a file extending write that requires
block allocation (we basically just tack on extra blocks to an extending
alloc based on some heuristics like the size of the file and the
previous extent). Whether that allocation occurs on one iomap iteration
or another due to a short write and retry, I wouldn't expect to matter
that much.

I suppose it could change the behavior of specialized workload though.
E.g., if it caused a write() call to return quicker and thus lead to a
file close(). We do use file release as an indication that prealloc will
not be used and can reclaim it at that point (presumably causing an
extent split with pre-eof blocks).

Brian

> --
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html