Re: Regression from 2.6.36

From: azurIt
Date: Fri Apr 15 2011 - 05:59:11 EST

Next message: Mark Brown: "Re: [PATCH] regulator: Group TI TPSxxxxx regulators together"
Previous message: Artem Bityutskiy: "Re: [PATCH 2/2] mtd: msm_nand: Add initial msm nand driver support."
In reply to: Mel Gorman: "Re: Regression from 2.6.36"
Next in thread: Mel Gorman: "Re: Regression from 2.6.36"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Also this new patch is working fine and fixing the problem.

Mel, I cannot run your script:
# perl watch-highorder-latency.pl
Failed to open /sys/kernel/debug/tracing/set_ftrace_filter for writing at watch-highorder-latency.pl line 17.

# ls -ld /sys/kernel/debug/
ls: cannot access /sys/kernel/debug/: No such file or directory

azur

______________________________________________________________
> Od: "Mel Gorman" <mel@xxxxxxxxx>
> Komu: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> DÃtum: 14.04.2011 12:25
> Predmet: Re: Regression from 2.6.36
>
> CC: "Eric Dumazet" <eric.dumazet@xxxxxxxxx>, "Changli Gao" <xiaosuo@xxxxxxxxx>, "Am?rico Wang" <xiyou.wangcong@xxxxxxxxx>, "Jiri Slaby" <jslaby@xxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx, linux-mm@xxxxxxxxx, linux-fsdevel@xxxxxxxxxxxxxxx, "Jiri Slaby" <jirislaby@xxxxxxxxx>
>On Wed, Apr 13, 2011 at 02:16:00PM -0700, Andrew Morton wrote:
>> On Wed, 13 Apr 2011 04:37:36 +0200
>> Eric Dumazet <eric.dumazet@xxxxxxxxx> wrote:
>>
>> > Le mardi 12 avril 2011 __ 18:31 -0700, Andrew Morton a __crit :
>> > > On Wed, 13 Apr 2011 09:23:11 +0800 Changli Gao <xiaosuo@xxxxxxxxx> wrote:
>> > >
>> > > > On Wed, Apr 13, 2011 at 6:49 AM, Andrew Morton
>> > > > <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
>> > > > >
>> > > > > It's somewhat unclear (to me) what caused this regression.
>> > > > >
>> > > > > Is it because the kernel is now doing large kmalloc()s for the fdtable,
>> > > > > and this makes the page allocator go nuts trying to satisfy high-order
>> > > > > page allocation requests?
>> > > > >
>> > > > > Is it because the kernel now will usually free the fdtable
>> > > > > synchronously within the rcu callback, rather than deferring this to a
>> > > > > workqueue?
>> > > > >
>> > > > > The latter seems unlikely, so I'm thinking this was a case of
>> > > > > high-order-allocations-considered-harmful?
>> > > > >
>> > > >
>> > > > Maybe, but I am not sure. Maybe my patch causes too many inner
>> > > > fragments. For example, when asking for 5 pages, get 8 pages, and 3
>> > > > pages are wasted, then memory thrash happens finally.
>> > >
>> > > That theory sounds less likely, but could be tested by using
>> > > alloc_pages_exact().
>> > >
>> >
>> > Very unlikely, since fdtable sizes are powers of two, unless you hit
>> > sysctl_nr_open and it was changed (default value being 2^20)
>> >
>>
>> So am I correct in believing that this regression is due to the
>> high-order allocations putting excess stress onto page reclaim?
>>
>
>This is very plausible but it would be nice to get confirmation on
>what the size of the fdtable was to be sure. If it's big enough for
>high-order allocations and it's a fork-heavy workload with memory
>mostly in use, the fork() latencies could be getting very high. In
>addition, each fork is potentially kicking kswapd awake (to rebalance
>the zone for higher orders). I do not see CONFIG_COMPACTION enabled
>meaning that if I'm right in that kswapd is awake and fork() is
>entering direct reclaim, then we are lumpy reclaiming as well which
>can stall pretty severely.
>
>> If so, then how large _are_ these allocations? This perhaps can be
>> determined from /proc/slabinfo. They must be pretty huge, because slub
>> likes to do excessively-large allocations and the system handles that
>> reasonably well.
>>
>
>I'd be interested in finding out the value of /proc/sys/fs/file-max and
>the output of ulimit -n (max open files) for the main server is. This
>should help us determine what the size of the fdtable is.
>
>> I suppose that a suitable fix would be
>>
>>
>> From: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
>>
>> Azurit reports large increases in system time after 2.6.36 when running
>> Apache. It was bisected down to a892e2d7dcdfa6c76e6 ("vfs: use kmalloc()
>> to allocate fdmem if possible").
>>
>> That patch caused the vfs to use kmalloc() for very large allocations and
>> this is causing excessive work (and presumably excessive reclaim) within
>> the page allocator.
>>
>> Fix it by falling back to vmalloc() earlier - when the allocation attempt
>> would have been considered "costly" by reclaim.
>>
>> Reported-by: azurIt <azurit@xxxxxxxx>
>> Cc: Changli Gao <xiaosuo@xxxxxxxxx>
>> Cc: Americo Wang <xiyou.wangcong@xxxxxxxxx>
>> Cc: Jiri Slaby <jslaby@xxxxxxx>
>> Cc: Eric Dumazet <eric.dumazet@xxxxxxxxx>
>> Cc: Mel Gorman <mel@xxxxxxxxx>
>> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
>> ---
>>
>> fs/file.c | 17 ++++++++++-------
>> 1 file changed, 10 insertions(+), 7 deletions(-)
>>
>> diff -puN fs/file.c~a fs/file.c
>> --- a/fs/file.c~a
>> +++ a/fs/file.c
>> @@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
>> */
>> static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
>>
>> -static inline void *alloc_fdmem(unsigned int size)
>> +static void *alloc_fdmem(unsigned int size)
>> {
>> - void *data;
>> -
>> - data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
>> - if (data != NULL)
>> - return data;
>> -
>> + /*
>> + * Very large allocations can stress page reclaim, so fall back to
>> + * vmalloc() if the allocation size will be considered "large" by the VM.
>> + */
>> + if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
>
>The reporter will need to retest this is really ok. The patch that was
>reported to help avoided high-order allocations entirely. If fork-heavy
>workloads are really entering direct reclaim and increasing fork latency
>enough to ruin performance, then this patch will also suffer. How much
>it helps depends on how big fdtable.
>
>> + void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
>> + if (data != NULL)
>> + return data;
>> + }
>> return vmalloc(size);
>> }
>>
>
>I'm attaching a primitive perl script that reports high-order allocation
>latencies. I'd be interesting to see what the output of it looks like,
>particularly when the server is in trouble if the bug reporter as the
>time.
>
>--
>Mel Gorman
>SUSE Labs
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Mark Brown: "Re: [PATCH] regulator: Group TI TPSxxxxx regulators together"
Previous message: Artem Bityutskiy: "Re: [PATCH 2/2] mtd: msm_nand: Add initial msm nand driver support."
In reply to: Mel Gorman: "Re: Regression from 2.6.36"
Next in thread: Mel Gorman: "Re: Regression from 2.6.36"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]