Re: Deadlock possibly caused by too_many_isolated.

From: Torsten Kaiser
Date: Wed Oct 20 2010 - 11:35:34 EST


On Wed, Oct 20, 2010 at 4:23 PM, Minchan Kim <minchan.kim@xxxxxxxxx> wrote:
> Hello
>
> On Wed, Oct 20, 2010 at 09:25:49AM +0200, Torsten Kaiser wrote:
>> On Wed, Oct 20, 2010 at 7:57 AM, Wu Fengguang <fengguang.wu@xxxxxxxxx> wrote:
>> > On Tue, Oct 19, 2010 at 06:06:21PM +0800, Torsten Kaiser wrote:
>> >> swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one
>> >> bio. That bio is the submitted, but the submit path seems to get into
>> >> make_request from raid1.c and that allocates a second bio from
>> >> bio_alloc() via bio_clone().
>> >>
>> >> I am seeing this pattern (swap_writepage calling
>> >> md_make_request/make_request and then getting stuck in mempool_alloc)
>> >> more than 5 times in the SysRq+T output...
>> >
>> > I bet the root cause is the failure of pool->alloc(__GFP_NORETRY)
>> > inside mempool_alloc(), which can be fixed by this patch.
>>
>> No. I tested the patch (ontop of Neils fix and your patch regarding
>> too_many_isolated()), but the system got stuck the same way on the
>> first try to fill the tmpfs.
>> I think the basic problem is, that the mempool that should guarantee
>> progress is exhausted because the raid1 device is stacked between the
>> pageout code and the disks and so the "use only 1 bio"-rule gets
>> violated.
>>
>> > Thanks,
>> > Fengguang
>> > ---
>> >
>> > concurrent direct page reclaim problem
>> >
>> > ?__GFP_NORETRY page allocations may fail when there are many concurrent page
>> > ?allocating tasks, but not necessary in real short of memory. The root cause
>> > ?is, tasks will first run direct page reclaim to free some pages from the LRU
>> > ?lists and put them to the per-cpu page lists and the buddy system, and then
>> > ?try to get a free page from there. ?However the free pages reclaimed by this
>> > ?task may be consumed by other tasks when the direct reclaim task is able to
>> > ?get the free page for itself.
>>
>> I believe the facts disagree with that assumtion. My bad for not
>> posting this before, but I also used SysRq+M to see whats going on,
>> but each time there still was some free memory.
>> Here is the SysRq+M output from the run with only Neils patch applied,
>> but on each other run the same ~14Mb stayed free
>
>
> What is your problem?(Sorry if you explained it several time).

The original problem was that using too many gcc's caused a swapstorm
that completely hung my system.
I first blame it one the workqueue changes in 2.6.36-rc1 and/or its
interaction with XFS (because in -rc5 a workqueue related problem in
XFS got fixed), but Tejun Heo found out that a) it were exhausted
mempools and not the workqueues and that the problems itself existed
at least in 2.6.35 already. In
http://marc.info/?l=linux-raid&m=128699402805191&w=2 I have describe a
simpler testcase that I found after looking more closely into the
mempools.

Short story: swaping over RAID1 (drivers/md/raid1.c) can cause a
system hang, because it is using too much of the fs_bio_set mempool
from fs/bio.c.

> I read the thread.
> It seems Wu's patch solved deadlock problem by FS lock holding and too_many_isolated.
> What is the problem remained in your case? unusable system by swapstorm?
> If it is, I think it's expected behavior. Please see the below comment.
> (If I don't catch your point, Please explain your problem.)

I do not have a problem, if the system becomes unusable *during* a
swapstorm, but it should recover. That is not the case in my system.
With Wu's too_many_isolated-patch and Neil's patch agains raid1.c the
system does no longer seem to be completely stuck (a swapoutrate of
~80kb every 20 seconds still happens), but I would still expect a
better recovery time. (At that rate the recovery would probably take a
few days...)

>> [  437.481365] SysRq : Show Memory
>> [  437.490003] Mem-Info:
>> [  437.491357] Node 0 DMA per-cpu:
>> [  437.500032] CPU    0: hi:    0, btch:   1 usd:   0
>> [  437.500032] CPU    1: hi:    0, btch:   1 usd:   0
>> [  437.500032] CPU    2: hi:    0, btch:   1 usd:   0
>> [  437.500032] CPU    3: hi:    0, btch:   1 usd:   0
>> [  437.500032] Node 0 DMA32 per-cpu:
>> [  437.500032] CPU    0: hi:  186, btch:  31 usd: 138
>> [  437.500032] CPU    1: hi:  186, btch:  31 usd:  30
>> [  437.500032] CPU    2: hi:  186, btch:  31 usd:   0
>> [  437.500032] CPU    3: hi:  186, btch:  31 usd:   0
>> [  437.500032] Node 1 DMA32 per-cpu:
>> [  437.500032] CPU    0: hi:  186, btch:  31 usd:   0
>> [  437.500032] CPU    1: hi:  186, btch:  31 usd:   0
>> [  437.500032] CPU    2: hi:  186, btch:  31 usd:   0
>> [  437.500032] CPU    3: hi:  186, btch:  31 usd:   0
>> [  437.500032] Node 1 Normal per-cpu:
>> [  437.500032] CPU    0: hi:  186, btch:  31 usd:   0
>> [  437.500032] CPU    1: hi:  186, btch:  31 usd:   0
>> [  437.500032] CPU    2: hi:  186, btch:  31 usd:  25
>> [  437.500032] CPU    3: hi:  186, btch:  31 usd:  30
>> [  437.500032] active_anon:2039 inactive_anon:985233 isolated_anon:682
>> [  437.500032]  active_file:1667 inactive_file:1723 isolated_file:0
>> [  437.500032]  unevictable:0 dirty:0 writeback:25387 unstable:0
>> [  437.500032]  free:3471 slab_reclaimable:2840 slab_unreclaimable:6337
>> [  437.500032]  mapped:1284 shmem:960501 pagetables:523 bounce:0
>> [  437.500032] Node 0 DMA free:8008kB min:28kB low:32kB high:40kB
>> active_anon:0kB inact
>> ive_anon:7596kB active_file:12kB inactive_file:0kB unevictable:0kB
>> isolated(anon):0kB i
>> solated(file):0kB present:15768kB mlocked:0kB dirty:0kB
>> writeback:404kB mapped:0kB shme
>> m:7192kB slab_reclaimable:32kB slab_unreclaimable:304kB
>> kernel_stack:0kB pagetables:0kB
>>  unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:118
>> all_unreclaimable? no
>> [  437.500032] lowmem_reserve[]: 0 2004 2004 2004
>
> Node 0 DMA : free 8008K but lowmem_reserve 8012K(2004 pages)
> So page allocator can't allocate the page unless preferred zone is DMA
>
>> [  437.500032] Node 0 DMA32 free:2980kB min:4036kB low:5044kB
>> high:6052kB active_anon:2
>> 844kB inactive_anon:1918424kB active_file:3428kB inactive_file:3780kB
>> unevictable:0kB isolated(anon):1232kB isolated(file):0kB
>> present:2052320kB mlocked:0kB dirty:0kB writeback:72016kB
>> mapped:2232kB shmem:1847640kB slab_reclaimable:5444kB
>> slab_unreclaimable:13508kB kernel_stack:744kB pagetables:864kB
>> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
>> all_unreclaimable? no
>> [  437.500032] lowmem_reserve[]: 0 0 0 0
>
> Node 0 DMA32 : free 2980K but min 4036K.
> Few file LRU compare to anon LRU

In the testcase I fill a tmpfs as fast as I can with data from
/dev/zero. So nearly everything gets swapped out and only the last
written data from the tmpfs fills all RAM. (I have 4GB RAM, the tmpfs
is limited to 6GB, 16 dd's are writing into it)

> Normally, it could fail to allocate the page.
> 'Normal' means caller doesn't request alloc_pages with __GFP_HIGH or !__GFP_WAIT
> Generally many call sites don't pass gfp_flag with __GFP_HIGH|!__GFP_WAIT.
>
>> [  437.500032] Node 1 DMA32 free:2188kB min:3036kB low:3792kB
>> high:4552kB active_anon:0kB inactive_anon:1555368kB active_file:0kB
>> inactive_file:28kB unevictable:0kB isolated(anon):768kB
>> isolated(file):0kB present:1544000kB mlocked:0kB dirty:0kB
>> writeback:21160kB mapped:0kB shmem:1534960kB slab_reclaimable:3728kB
>> slab_unreclaimable:7076kB kernel_stack:8kB pagetables:0kB unstable:0kB
>> bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
>> [  437.500032] lowmem_reserve[]: 0 0 505 505
>
> Node 1 DMA32 free : 2188K min 3036K
> It's a same situation with Node 0 DMA32.
> Normally, it could fail to allocate the page.
> Few file LRU compare to anon LRU
>
>
>> [  437.500032] Node 1 Normal free:708kB min:1016kB low:1268kB
>> high:1524kB active_anon:5312kB inactive_anon:459544kB
>> active_file:3228kB inactive_file:3084kB unevictable:0kB
>> isolated(anon):728kB isolated(file):0kB present:517120kB mlocked:0kB
>> dirty:0kB writeback:7968kB mapped:2904kB shmem:452212kB
>> slab_reclaimable:2156kB slab_unreclaimable:4460kB kernel_stack:200kB
>> pagetables:1228kB unstable:0kB bounce:0kB writeback_tmp:0kB
>> pages_scanned:9678 all_unreclaimable? no
>> [  437.500032] lowmem_reserve[]: 0 0 0 0
>
> Node 1 Normal : free 708K min 1016K
> Normally, it could fail to allocate the page.
> Few file LRU compare to anon LRU
>
>> [  437.500032] Node 0 DMA: 2*4kB 2*8kB 1*16kB 3*32kB 3*64kB 4*128kB
>> 4*256kB 2*512kB 1*1024kB 2*2048kB 0*4096kB = 8008kB
>> [  437.500032] Node 0 DMA32: 27*4kB 15*8kB 8*16kB 8*32kB 7*64kB
>> 1*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 2980kB
>> [  437.500032] Node 1 DMA32: 1*4kB 6*8kB 3*16kB 1*32kB 0*64kB 1*128kB
>> 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 2308kB
>> [  437.500032] Node 1 Normal: 39*4kB 13*8kB 10*16kB 3*32kB 1*64kB
>> 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 708kB
>> [  437.500032] 989289 total pagecache pages
>> [  437.500032] 25398 pages in swap cache
>> [  437.500032] Swap cache stats: add 859204, delete 833806, find 28/39
>> [  437.500032] Free swap  = 9865628kB
>> [  437.500032] Total swap = 10000316kB
>> [  437.500032] 1048575 pages RAM
>> [  437.500032] 33809 pages reserved
>> [  437.500032] 7996 pages shared
>> [  437.500032] 1008521 pages non-shared
>>
> All zones don't have enough pages and don't have enough file lru pages.
> So swapout is expected behavior, I think.
> It means your workload exceeds your system available DRAM size.

Yes, as intended. I wanted to create many writes to a RAID1 device
under memory pressure to show/verify that the current use of mempools
in raid1.c is buggered.

That is not really any sane workload, that literally is just there to
create a swapstorm and then see if the system survives it.

The problem is, that the system is not surviving it: bio allocations
fail in raid1.c and it falls back to the fs_bio_set mempool. But that
mempool is only 2 entries big, because you should ever only use one of
its entries at a time. But the current mainline code from raid1.c
allocates one bio per drive before submitting it -> That bug is fixed
my Neil's patch and I would have expected that to fix my hang. But it
seems that there is an additional problem so that mempool still get
emptied. And that means that no writeback happens any longer and so
the kernel can't swapout and gets stuck.

I think the last mail from Jens Axboe is the correct answer, not
increasing the fs_bio_set mempool size via BIO_POOL_SIZE.

But should that go even further: Forbid any use of bio_alloc() and
bio_clone() in any device drivers? Or at the very least in all device
drivers that could be used for swapspace?

Torsten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/