Re: [BUG] fatal hang untarring 90GB file, possibly writebackrelated.

From: Colin Ian King
Date: Tue May 03 2011 - 14:56:00 EST


On Fri, 2011-04-29 at 00:40 +0200, Jan Kara wrote:
> On Thu 28-04-11 15:58:21, Colin Ian King wrote:
> > On Thu, 2011-04-28 at 16:33 +0200, Jan Kara wrote:
> > > On Thu 28-04-11 16:25:51, Jan Kara wrote:
> > > > On Thu 28-04-11 15:01:22, Colin Ian King wrote:
> > > > >
> > > > > > Could you post the soft lockups you're seeing?
> > > > >
> > > > > As requested, attached
> > > > Hum, what keeps puzzling me is that in all the cases of hangs I've seen
> > > > so far, we are stuck waiting for IO to finish for a long time - e.g. in the
> > > > traces below kjournald waits for PageWriteback bit to get cleared. Also we
> > > > are stuck waiting for page locks which might be because those pages are
> > > > being read in? All in all it seems that the IO is just incredibly slow.
> > > >
> > > > But it's not clear to me what pushes us into that situation (especially
> > > > since ext4 refuses to do any IO from ->writepage (i.e. kswapd) when the
> > > > underlying blocks are not already allocated.
> > > Hmm, maybe because the system is under memory pressure (and kswapd is not
> > > able to get rid of dirty pages), we page out clean pages. Thus also pages
> > > of executables which need to be paged in soon anyway thus putting heavy
> > > read load on the system which makes writes crawl? I'm not sure why
> > > compaction should make this any worse but maybe it can.
> > >
> > > James, Colin, can you capture output of 'vmstat 1' while you do the
> > > copying? Thanks.
> >
> > Attached.
> Thanks. So I there are a few interesting points in the vmstat output:
> For first 30 seconds, we are happily copying data - relatively steady read
> throughput (about 20-40 MB/s) and occasional peak from flusher thread
> flushing dirty data. During this time free memory drops from about 1.4 GB
> to about 22!!! MB - mm seems to like to really use the machine ;). Then
> things get interesting:
> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
> r b swpd free buff cache si so bi bo in cs us sy id wa
> 0 1 0 83372 5228 1450776 0 0 39684 90132 450 918 0 4 74 22
> 0 1 0 22416 5228 1509864 0 0 29452 48492 403 869 1 2 80 18
> 2 0 0 20056 5384 1513996 0 0 2248 2116 434 1191 4 4 71 21
> 0 1 0 19800 5932 1514600 0 0 644 104 454 1166 8 3 64 24
> 1 0 0 21848 5940 1515244 0 0 292 380 468 1775 15 6 75 3
> 1 0 0 20936 5940 1515876 0 0 296 296 496 1846 18 8 74 0
> 1 0 0 17792 5940 1516580 0 0 356 356 484 1862 23 8 69 0
> 1 0 0 17544 5940 1517364 0 0 412 412 482 1812 16 7 77 0
> 4 0 0 18148 5948 1517968 0 0 288 344 436 1749 19 9 69 3
> 0 2 220 137528 1616 1402468 0 220 74708 2164 849 1806 3 6 63 28
> 0 3 224 36184 1628 1499648 0 4 50820 86204 532 1272 0 4 64 32
> 0 2 19680 53688 1628 1484388 32 19456 6080 62972 242 287 0 2 63 34
> 0 2 36928 1407432 1356 150980 0 17252 1564 17276 368 764 1 3 73 22
>
> So when free memory reached about 20 MB, both read and write activity
> almost stalled for 7 s (probably everybody waits for free memory). Then
> mm manages to free 100 MB from page cache, things move on for two seconds,
> then we swap out! about 36 MB and page reclaim also finally decides it
> maybe has too much of page cache and reaps most of it (1.3 GB in one go).

> Then things get going again, although there are still occasional stalls
> such as this (about 30s later):
> 1 3 36688 753192 1208 792344 0 0 35948 32768 435 6625 0 6 61 33
> 0 2 36668 754996 1344 792760 0 0 252 58736 2832 16239 0 1 60 39
> 0 2 36668 750132 1388 796688 0 0 2508 1524 325 959 1 3 68 28
> 1 0 36668 751160 1400 797968 0 0 620 620 460 1470 6 6 50 38
> 1 0 36668 750516 1400 798520 0 0 300 300 412 1764 17 8 75 1
> 1 0 36668 750648 1408 799108 0 0 280 340 423 1816 18 6 73 3
> 1 0 36668 748856 1408 799752 0 0 336 328 409 1788 18 8 75 0
> 1 0 36668 748120 1416 800604 0 0 428 452 407 1723 14 10 75 2
> 1 0 36668 750048 1416 801176 0 0 296 296 405 1779 18 7 75 1
> 0 1 36668 650428 1420 897252 0 0 48100 556 658 1718 10 3 71 15
> 0 2 36668 505444 1424 1037012 0 0 69888 90272 686 1491 1 4 68 27
> 0 1 36668 479264 1428 1063372 0 0 12984 40896 324 674 1 1 76 23
> ...
> I'm not sure what we were blocked on here since there is still plenty of
> free memory (750 MB). These stalls repeat once in a while but things go on.
> Then at about 350s, things just stop:
> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
> r b swpd free buff cache si so bi bo in cs us sy id wa
> 3 1 75280 73564 1844 1503848 0 0 43396 81976 627 1061 0 25 42 32
> 3 3 75280 73344 1852 1504256 0 0 256 20 240 149 0 26 25 49
> 3 3 75280 73344 1852 1504268 0 0 0 0 265 140 0 29 13 58
> 3 3 75280 73468 1852 1504232 0 0 0 0 265 132 0 22 34 44
> 3 3 75280 73468 1852 1504232 0 0 0 0 339 283 0 25 26 49
> 3 3 75280 73468 1852 1504232 0 0 0 0 362 327 0 25 25 50
> 3 3 75280 73468 1852 1504232 0 0 0 0 317 320 0 26 25 49
> 3 3 75280 73468 1852 1504232 0 0 0 0 361 343 0 26 25 50
>
> and nothing really happens for 150 s, except more and more tasks blocking
> in D state (second column).
> 3 6 75272 73416 1872 1503872 0 0 0 0 445 700 0 25 25 50
> 0 7 75264 67940 1884 1509008 64 0 5056 16 481 876 0 22 23 55
> Then suddently things get going again:
> 0 2 75104 76808 1892 1502552 352 0 14292 40456 459 14865 0 2 39 59
> 0 1 75104 75704 1900 1503412 0 0 820 32 405 788 1 1 72 27
> 1 0 75104 76512 1904 1505576 0 0 1068 1072 454 1586 8 7 74 11
>
> I guess this 150 s stall is when kernel barfs the "task blocked for more
> than 30 seconds" messages. And from the traces we know that everyone is
> waiting for PageWriteback or page lock during this time. Also James's vmstat
> report shows that IO is stalled when kswapd is spinning. Really strange.

Just to add, this machine has relatively low amount of memory (1GB).
I've re-run the tests today with cgroups disabled and it ran for 47
'copy' cycles, 27 'copy' cycles and then 35 'copy' cycles.

One extra data point, with maxcpus=1 I get a lockup after 2 'copy'
cycles every time, so it's more predictable than the default 4 processor
configuration.

>
> James in the meantime identified that cgroups are somehow involved. Are you
> using systemd by any chance?

No, I'm using upstart.

> Maybe cgroup IO throttling screws us?
>
> Honza
>
> > > > [ 287.088371] INFO: task rs:main Q:Reg:749 blocked for more than 30 seconds.
> > > > [ 287.088374] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > [ 287.088376] rs:main Q:Reg D 0000000000000000 0 749 1 0x00000000
> > > > [ 287.088381] ffff880072c17b68 0000000000000082 ffff880072c17fd8 ffff880072c16000
> > > > [ 287.088392] 0000000000013d00 ffff88003591b178 ffff880072c17fd8 0000000000013d00
> > > > [ 287.088396] ffffffff81a0b020 ffff88003591adc0 ffff88001fffc3e8 ffff88001fc13d00
> > > > [ 287.088400] Call Trace:
> > > > [ 287.088404] [<ffffffff8110c070>] ? sync_page+0x0/0x50
> > > > [ 287.088408] [<ffffffff815c0990>] io_schedule+0x70/0xc0
> > > > [ 287.088411] [<ffffffff8110c0b0>] sync_page+0x40/0x50
> > > > [ 287.088414] [<ffffffff815c130f>] __wait_on_bit+0x5f/0x90
> > > > [ 287.088418] [<ffffffff8110c278>] wait_on_page_bit+0x78/0x80
> > > > [ 287.088421] [<ffffffff81087f70>] ? wake_bit_function+0x0/0x50
> > > > [ 287.088425] [<ffffffff8110dffd>] __lock_page_or_retry+0x3d/0x70
> > > > [ 287.088428] [<ffffffff8110e3c7>] filemap_fault+0x397/0x4a0
> > > > [ 287.088431] [<ffffffff8112d144>] __do_fault+0x54/0x520
> > > > [ 287.088434] [<ffffffff81134a43>] ? unmap_region+0x113/0x170
> > > > [ 287.088437] [<ffffffff812ded90>] ? prio_tree_insert+0x150/0x1c0
> > > > [ 287.088440] [<ffffffff811309da>] handle_pte_fault+0xfa/0x210
> > > > [ 287.088442] [<ffffffff810442a7>] ? pte_alloc_one+0x37/0x50
> > > > [ 287.088446] [<ffffffff815c2cce>] ? _raw_spin_lock+0xe/0x20
> > > > [ 287.088448] [<ffffffff8112de25>] ? __pte_alloc+0xb5/0x100
> > > > [ 287.088451] [<ffffffff81131d5d>] handle_mm_fault+0x16d/0x250
> > > > [ 287.088454] [<ffffffff815c6a47>] do_page_fault+0x1a7/0x540
> > > > [ 287.088457] [<ffffffff81136f85>] ? do_mmap_pgoff+0x335/0x370
> > > > [ 287.088460] [<ffffffff81137127>] ? sys_mmap_pgoff+0x167/0x230
> > > > [ 287.088463] [<ffffffff815c34d5>] page_fault+0x25/0x30
> > > > [ 287.088466] INFO: task NetworkManager:764 blocked for more than 30 seconds.
> > > > [ 287.088468] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > [ 287.088470] NetworkManager D 0000000000000002 0 764 1 0x00000000
> > > > [ 287.088473] ffff880074ffbb68 0000000000000082 ffff880074ffbfd8 ffff880074ffa000
> > > > [ 287.088477] 0000000000013d00 ffff880036051a98 ffff880074ffbfd8 0000000000013d00
> > > > [ 287.088481] ffff8801005badc0 ffff8800360516e0 ffff88001ffef128 ffff88001fc53d00
> > > > [ 287.088484] Call Trace:
> > > > [ 287.088488] [<ffffffff8110c070>] ? sync_page+0x0/0x50
> > > > [ 287.088491] [<ffffffff815c0990>] io_schedule+0x70/0xc0
> > > > [ 287.088494] [<ffffffff8110c0b0>] sync_page+0x40/0x50
> > > > [ 287.088497] [<ffffffff815c130f>] __wait_on_bit+0x5f/0x90
> > > > [ 287.088500] [<ffffffff8110c278>] wait_on_page_bit+0x78/0x80
> > > > [ 287.088503] [<ffffffff81087f70>] ? wake_bit_function+0x0/0x50
> > > > [ 287.088506] [<ffffffff8110dffd>] __lock_page_or_retry+0x3d/0x70
> > > > [ 287.088509] [<ffffffff8110e3c7>] filemap_fault+0x397/0x4a0
> > > > [ 287.088513] [<ffffffff81177110>] ? pollwake+0x0/0x60
> > > > [ 287.088516] [<ffffffff8112d144>] __do_fault+0x54/0x520
> > > > [ 287.088519] [<ffffffff81177110>] ? pollwake+0x0/0x60
> > > > [ 287.088522] [<ffffffff811309da>] handle_pte_fault+0xfa/0x210
> > > > [ 287.088525] [<ffffffff8111561d>] ? __free_pages+0x2d/0x40
> > > > [ 287.088527] [<ffffffff8112de4f>] ? __pte_alloc+0xdf/0x100
> > > > [ 287.088530] [<ffffffff81131d5d>] handle_mm_fault+0x16d/0x250
> > > > [ 287.088533] [<ffffffff815c6a47>] do_page_fault+0x1a7/0x540
> > > > [ 287.088537] [<ffffffff81013859>] ? read_tsc+0x9/0x20
> > > > [ 287.088540] [<ffffffff81092eb1>] ? ktime_get_ts+0xb1/0xf0
> > > > [ 287.088543] [<ffffffff811776d2>] ? poll_select_set_timeout+0x82/0x90
> > > > [ 287.088546] [<ffffffff815c34d5>] page_fault+0x25/0x30
> > > > [ 287.088559] INFO: task unity-panel-ser:1521 blocked for more than 30 seconds.
> > > > [ 287.088561] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > [ 287.088562] unity-panel-ser D 0000000000000000 0 1521 1 0x00000000
> > > > [ 287.088566] ffff880061f37b68 0000000000000082 ffff880061f37fd8 ffff880061f36000
> > > > [ 287.088570] 0000000000013d00 ffff880068c7c858 ffff880061f37fd8 0000000000013d00
> > > > [ 287.088573] ffff88003591c4a0 ffff880068c7c4a0 ffff88001fff0c88 ffff88001fc13d00
> > > > [ 287.088577] Call Trace:
> > > > [ 287.088581] [<ffffffff8110c070>] ? sync_page+0x0/0x50
> > > > [ 287.088583] [<ffffffff815c0990>] io_schedule+0x70/0xc0
> > > > [ 287.088587] [<ffffffff8110c0b0>] sync_page+0x40/0x50
> > > > [ 287.088589] [<ffffffff815c130f>] __wait_on_bit+0x5f/0x90
> > > > [ 287.088593] [<ffffffff8110c278>] wait_on_page_bit+0x78/0x80
> > > > [ 287.088596] [<ffffffff81087f70>] ? wake_bit_function+0x0/0x50
> > > > [ 287.088599] [<ffffffff8110dffd>] __lock_page_or_retry+0x3d/0x70
> > > > [ 287.088602] [<ffffffff8110e3c7>] filemap_fault+0x397/0x4a0
> > > > [ 287.088605] [<ffffffff8112d144>] __do_fault+0x54/0x520
> > > > [ 287.088608] [<ffffffff811309da>] handle_pte_fault+0xfa/0x210
> > > > [ 287.088610] [<ffffffff8111561d>] ? __free_pages+0x2d/0x40
> > > > [ 287.088613] [<ffffffff8112de4f>] ? __pte_alloc+0xdf/0x100
> > > > [ 287.088616] [<ffffffff81131d5d>] handle_mm_fault+0x16d/0x250
> > > > [ 287.088619] [<ffffffff815c6a47>] do_page_fault+0x1a7/0x540
> > > > [ 287.088622] [<ffffffff81136f85>] ? do_mmap_pgoff+0x335/0x370
> > > > [ 287.088625] [<ffffffff815c34d5>] page_fault+0x25/0x30
> > > > [ 287.088629] INFO: task jbd2/sda4-8:1845 blocked for more than 30 seconds.
> > > > [ 287.088630] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > [ 287.088632] jbd2/sda4-8 D 0000000000000000 0 1845 2 0x00000000
> > > > [ 287.088636] ffff880068f6baf0 0000000000000046 ffff880068f6bfd8 ffff880068f6a000
> > > > [ 287.088639] 0000000000013d00 ffff880061d603b8 ffff880068f6bfd8 0000000000013d00
> > > > [ 287.088643] ffff88003591c4a0 ffff880061d60000 ffff88001fff8548 ffff88001fc13d00
> > > > [ 287.088647] Call Trace:
> > > > [ 287.088650] [<ffffffff8110c070>] ? sync_page+0x0/0x50
> > > > [ 287.088653] [<ffffffff815c0990>] io_schedule+0x70/0xc0
> > > > [ 287.088656] [<ffffffff8110c0b0>] sync_page+0x40/0x50
> > > > [ 287.088659] [<ffffffff815c130f>] __wait_on_bit+0x5f/0x90
> > > > [ 287.088662] [<ffffffff8110c278>] wait_on_page_bit+0x78/0x80
> > > > [ 287.088665] [<ffffffff81087f70>] ? wake_bit_function+0x0/0x50
> > > > [ 287.088668] [<ffffffff8110c41d>] filemap_fdatawait_range+0xfd/0x190
> > > > [ 287.088672] [<ffffffff8110c4db>] filemap_fdatawait+0x2b/0x30
> > > > [ 287.088675] [<ffffffff81242a93>] journal_finish_inode_data_buffers+0x63/0x170
> > > > [ 287.088678] [<ffffffff81243284>] jbd2_journal_commit_transaction+0x6e4/0x1190
> > > > [ 287.088682] [<ffffffff81076185>] ? try_to_del_timer_sync+0x85/0xe0
> > > > [ 287.088685] [<ffffffff81247e9b>] kjournald2+0xbb/0x220
> > > > [ 287.088688] [<ffffffff81087f30>] ? autoremove_wake_function+0x0/0x40
> > > > [ 287.088691] [<ffffffff81247de0>] ? kjournald2+0x0/0x220
> > > > [ 287.088694] [<ffffffff810877e6>] kthread+0x96/0xa0
> > > > [ 287.088697] [<ffffffff8100ce24>] kernel_thread_helper+0x4/0x10
> > > > [ 287.088700] [<ffffffff81087750>] ? kthread+0x0/0xa0
> > > > [ 287.088703] [<ffffffff8100ce20>] ? kernel_thread_helper+0x0/0x10
> > > > [ 287.088705] INFO: task dirname:5969 blocked for more than 30 seconds.
> > > > [ 287.088707] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > [ 287.088709] dirname D 0000000000000002 0 5969 5214 0x00000000
> > > > [ 287.088712] ffff88005bd9d8b8 0000000000000086 ffff88005bd9dfd8 ffff88005bd9c000
> > > > [ 287.088716] 0000000000013d00 ffff88005d65b178 ffff88005bd9dfd8 0000000000013d00
> > > > [ 287.088720] ffff8801005e5b80 ffff88005d65adc0 ffff88001ffe5228 ffff88001fc53d00
> > > > [ 287.088723] Call Trace:
> > > > [ 287.088726] [<ffffffff8110c070>] ? sync_page+0x0/0x50
> > > > [ 287.088729] [<ffffffff815c0990>] io_schedule+0x70/0xc0
> > > > [ 287.088732] [<ffffffff8110c0b0>] sync_page+0x40/0x50
> > > > [ 287.088735] [<ffffffff815c130f>] __wait_on_bit+0x5f/0x90
> > > > [ 287.088738] [<ffffffff8110c278>] wait_on_page_bit+0x78/0x80
> > > > [ 287.088741] [<ffffffff81087f70>] ? wake_bit_function+0x0/0x50
> > > > [ 287.088744] [<ffffffff8110dffd>] __lock_page_or_retry+0x3d/0x70
> > > > [ 287.088747] [<ffffffff8110e3c7>] filemap_fault+0x397/0x4a0
> > > > [ 287.088750] [<ffffffff8112d144>] __do_fault+0x54/0x520
> > > > [ 287.088753] [<ffffffff811309da>] handle_pte_fault+0xfa/0x210
> > > > [ 287.088756] [<ffffffff810442a7>] ? pte_alloc_one+0x37/0x50
> > > > [ 287.088759] [<ffffffff815c2cce>] ? _raw_spin_lock+0xe/0x20
> > > > [ 287.088761] [<ffffffff8112de25>] ? __pte_alloc+0xb5/0x100
> > > > [ 287.088764] [<ffffffff81131d5d>] handle_mm_fault+0x16d/0x250
> > > > [ 287.088767] [<ffffffff815c6a47>] do_page_fault+0x1a7/0x540
> > > > [ 287.088770] [<ffffffff81136947>] ? mmap_region+0x1f7/0x500
> > > > [ 287.088773] [<ffffffff8112db06>] ? free_pgd_range+0x356/0x4a0
> > > > [ 287.088776] [<ffffffff815c34d5>] page_fault+0x25/0x30
> > > > [ 287.088779] [<ffffffff812e6d5f>] ? __clear_user+0x3f/0x70
> > > > [ 287.088782] [<ffffffff812e6d41>] ? __clear_user+0x21/0x70
> > > > [ 287.088786] [<ffffffff812e6dc6>] clear_user+0x36/0x40
> > > > [ 287.088788] [<ffffffff811b0b6d>] padzero+0x2d/0x40
> > > > [ 287.088791] [<ffffffff811b2c7a>] load_elf_binary+0x95a/0xe00
> > > > [ 287.088794] [<ffffffff8116aa8a>] search_binary_handler+0xda/0x300
> > > > [ 287.088797] [<ffffffff811b2320>] ? load_elf_binary+0x0/0xe00
> > > > [ 287.088800] [<ffffffff8116c49c>] do_execve+0x24c/0x2d0
> > > > [ 287.088802] [<ffffffff8101521a>] sys_execve+0x4a/0x80
> > > > [ 287.088805] [<ffffffff8100c45c>] stub_execve+0x6c/0xc0
> > > > --
> > > > Jan Kara <jack@xxxxxxx>
> > > > SUSE Labs, CR
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>
>


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/