Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists afterdirect reclaim allocation fails

From: Dave Chinner
Date: Fri Sep 03 2010 - 22:27:29 EST

Next message: Vipin Mehta: "RE: linux-next: Tree for September 3 (ar6k)"
Previous message: Greg KH: "Re: linux-next: Tree for September 3 (ar6k)"
In reply to: Andrew Morton: "Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists afterdirect reclaim allocation fails"
Next in thread: Andrew Morton: "Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists afterdirect reclaim allocation fails"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, Sep 03, 2010 at 04:00:26PM -0700, Andrew Morton wrote:
> On Fri, 3 Sep 2010 10:08:46 +0100
> Mel Gorman <mel@xxxxxxxxx> wrote:
>
> > When under significant memory pressure, a process enters direct reclaim
> > and immediately afterwards tries to allocate a page. If it fails and no
> > further progress is made, it's possible the system will go OOM. However,
> > on systems with large amounts of memory, it's possible that a significant
> > number of pages are on per-cpu lists and inaccessible to the calling
> > process. This leads to a process entering direct reclaim more often than
> > it should increasing the pressure on the system and compounding the problem.
> >
> > This patch notes that if direct reclaim is making progress but
> > allocations are still failing that the system is already under heavy
> > pressure. In this case, it drains the per-cpu lists and tries the
> > allocation a second time before continuing.
....
> The patch looks reasonable.
>
> But please take a look at the recent thread "mm: minute-long livelocks
> in memory reclaim". There, people are pointing fingers at that
> drain_all_pages() call, suspecting that it's causing huge IPI storms.
>
> Dave was going to test this theory but afaik hasn't yet done so. It
> would be nice to tie these threads together if poss?

It's been my "next-thing-to-do" since David suggested I try it -
tracking down other problems has got in the way, though. I
just ran my test a couple of times through:

$ ./fs_mark -D 10000 -L 63 -S0 -n 100000 -s 0 \
-d /mnt/scratch/0 -d /mnt/scratch/1 \
-d /mnt/scratch/3 -d /mnt/scratch/2 \
-d /mnt/scratch/4 -d /mnt/scratch/5 \
-d /mnt/scratch/6 -d /mnt/scratch/7

To create millions of inodes in parallel on an 8p/4G RAM VM.
The filesystem is ~1.1TB XFS:

# mkfs.xfs -f -d agcount=16 /dev/vdb
meta-data=/dev/vdb isize=256 agcount=16, agsize=16777216 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=268435456, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=131072, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
# mount -o inode64,delaylog,logbsize=262144,nobarrier /dev/vdb /mnt/scratch

Performance prior to this patch was that each iteration resulted in
~65k files/s, with occassionaly peaks to 90k files/s, but drops to
frequently 45k files/s when reclaim ran to reclaim the inode
caches. This load ran permanently at 800% CPU usage.

Every so often (may once or twice a 50M inode create run) all 8 CPUs
would remain pegged but the create rate would drop to zero for a few
seconds to a couple of minutes. that was the livelock issues I
reported.

With this patchset, I'm seeing a per-iteration average of ~77k
files/s, with only a couple of iterations dropping down to ~55k
file/s and a significantly number above 90k/s. The runtime to 50M
inodes is down by ~30% and the average CPU usage across the run is
around 700%. IOWs, there a significant gain in performance there is
a significant drop in CPU usage. I've done two runs to 50m inodes,
and not seen any sign of a livelock, even for short periods of time.

Ah, spoke too soon - I let the second run keep going, and at ~68M
inodes it's just pegged all the CPUs and is pretty much completely
wedged. Serial console is not responding, I can't get a new login,
and the only thing responding that tells me the machine is alive is
the remote PCP monitoring. It's been stuck for 5 minutes .... and
now it is back. Here's what I saw:

http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-wedge-1.png

The livelock is at the right of the charts, where the top chart is
all red (system CPU time), and the other charts flat line to zero.

And according to fsmark:

1 66400000 0 64554.2 7705926
1 67200000 0 64836.1 7573013
<hang happened here>
2 68000000 0 69472.8 7941399
2 68800000 0 85017.5 7585203

it didn't record any change in performance, which means the livelock
probably occurred between iterations. I couldn't get any info on
what caused the livelock this time so I can only assume it has the
same cause....

Still, given the improvements in performance from this patchset,
I'd say inclusion is a no-braniner....

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Vipin Mehta: "RE: linux-next: Tree for September 3 (ar6k)"
Previous message: Greg KH: "Re: linux-next: Tree for September 3 (ar6k)"
In reply to: Andrew Morton: "Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists afterdirect reclaim allocation fails"
Next in thread: Andrew Morton: "Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists afterdirect reclaim allocation fails"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]