Re: [dm-devel] [PATCH] Fix over-zealous flush_disk when changing device size.

From: Jeff Moyer
Date: Wed Mar 16 2011 - 16:30:50 EST


NeilBrown <neilb@xxxxxxx> writes:

>> Synchronous notification of errors. If we don't try to write everything
>> back immediately after the size change, we don't see dirty pages in
>> zapped regions until the writeout/page cache management takes it into
>> its head to try to clean the pages.
>>
>
> So if you just want synchronous errors, I think you want:
> fsync_bdev()
>
> which calls sync_filesystem() if it can find a filesystem, else
> sync_blockdev(); (sync_filesystem itself calls sync_blockdev too).

... which deadlocks md. ;-) writeback_inodes_sb_nr is waiting for the
flusher thread to write back the dirty data. The flusher thread is
stuck in md_write_start, here:

wait_event(mddev->sb_wait,
!test_bit(MD_CHANGE_PENDING, &mddev->flags));

This is after reverting your change, and replacing the flush_disk call
in check_disk_size_change with a call to fsync_bdev. I'm not familiar
enough with md to really suggest a way forward. Neil?

Cheers,
Jeff

md127: detected capacity change from 267386880 to 401080320
INFO: task md127_raid5:2255 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
md127_raid5 D ffff88011d010690 5416 2255 2 0x00000080
ffff88010fcc7990 0000000000000046 ffff880100000000 ffffffff812070c9
0000000000014d00 ffff88011d010100 ffff88011d010690 ffff88010fcc7fd8
ffff88011d010698 0000000000014d00 ffff88010fcc6010 0000000000014d00
Call Trace:
[<ffffffff812070c9>] ? cpumask_next_and+0x29/0x50
[<ffffffff81493df5>] schedule_timeout+0x265/0x2d0
[<ffffffff8104b341>] ? enqueue_task+0x61/0x80
[<ffffffff81493a25>] wait_for_common+0x115/0x180
[<ffffffff81057850>] ? default_wake_function+0x0/0x10
[<ffffffff81493b38>] wait_for_completion+0x18/0x20
[<ffffffff8115cce2>] writeback_inodes_sb_nr+0x72/0xa0
[<ffffffff8115cfad>] writeback_inodes_sb+0x4d/0x60
[<ffffffff81162499>] __sync_filesystem+0x49/0x90
[<ffffffff81162592>] sync_filesystem+0x32/0x60
[<ffffffff8116bc59>] fsync_bdev+0x29/0x70
[<ffffffff8116bcea>] check_disk_size_change+0x4a/0xb0
[<ffffffff81208e27>] ? kobject_put+0x27/0x60
[<ffffffff8116bdaf>] revalidate_disk+0x5f/0x90
[<ffffffffa031155a>] raid5_finish_reshape+0x9a/0x1e0 [raid456]
[<ffffffff8138a933>] reap_sync_thread+0x63/0x130
[<ffffffff8138c8a6>] md_check_recovery+0x1f6/0x6f0
[<ffffffffa03150ab>] raid5d+0x3b/0x610 [raid456]
[<ffffffff810804c9>] ? prepare_to_wait+0x59/0x90
[<ffffffff81387ee9>] md_thread+0x119/0x150
[<ffffffff810801d0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff81387dd0>] ? md_thread+0x0/0x150
[<ffffffff8107fb56>] kthread+0x96/0xa0
[<ffffffff8100cc04>] kernel_thread_helper+0x4/0x10
[<ffffffff8107fac0>] ? kthread+0x0/0xa0
[<ffffffff8100cc00>] ? kernel_thread_helper+0x0/0x10
INFO: task flush-9:127:2288 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
flush-9:127 D ffff88011cee30a0 4664 2288 2 0x00000080
ffff88011b0af6a0 0000000000000046 0000000000000000 0000000000000000
0000000000014d00 ffff88011cee2b10 ffff88011cee30a0 ffff88011b0affd8
ffff88011cee30a8 0000000000014d00 ffff88011b0ae010 0000000000014d00
Call Trace:
[<ffffffff8138bbb5>] md_write_start+0xa5/0x1c0
[<ffffffff810801d0>] ? autoremove_wake_function+0x0/0x40
[<ffffffffa0316435>] make_request+0x45/0x6c0 [raid456]
[<ffffffff811fbfcb>] ? blkiocg_update_dispatch_stats+0x8b/0xd0
[<ffffffff81385ca3>] md_make_request+0xd3/0x210
[<ffffffff811ee9da>] generic_make_request+0x2ea/0x5d0
[<ffffffff810e9cde>] ? mempool_alloc+0x5e/0x140
[<ffffffff811eed41>] submit_bio+0x81/0x110
[<ffffffff811699c6>] ? bio_alloc_bioset+0x56/0xf0
[<ffffffff81163ef6>] submit_bh+0xe6/0x140
[<ffffffff81165ad0>] __block_write_full_page+0x200/0x390
[<ffffffff811655a0>] ? end_buffer_async_write+0x0/0x1a0
[<ffffffff8116667e>] block_write_full_page_endio+0xde/0x110
[<ffffffffa037d3b0>] ? buffer_unmapped+0x0/0x20 [ext3]
[<ffffffff811666c0>] block_write_full_page+0x10/0x20
[<ffffffffa037de6d>] ext3_writeback_writepage+0x11d/0x170 [ext3]
[<ffffffff810f0152>] __writepage+0x12/0x40
[<ffffffff810f12b4>] write_cache_pages+0x1a4/0x490
[<ffffffff810f0140>] ? __writepage+0x0/0x40
[<ffffffff810f15bf>] generic_writepages+0x1f/0x30
[<ffffffff810f15f5>] do_writepages+0x25/0x30
[<ffffffff8115d5f0>] writeback_single_inode+0x90/0x220
[<ffffffff8115d9b6>] writeback_sb_inodes+0xc6/0x170
[<ffffffff8115dd3f>] wb_writeback+0x17f/0x430
[<ffffffff8106e217>] ? lock_timer_base+0x37/0x70
[<ffffffff8115e08d>] wb_do_writeback+0x9d/0x270
[<ffffffff8106e330>] ? process_timeout+0x0/0x10
[<ffffffff8115e302>] bdi_writeback_thread+0xa2/0x280
[<ffffffff8115e260>] ? bdi_writeback_thread+0x0/0x280
[<ffffffff8115e260>] ? bdi_writeback_thread+0x0/0x280
[<ffffffff8107fb56>] kthread+0x96/0xa0
[<ffffffff8100cc04>] kernel_thread_helper+0x4/0x10
[<ffffffff8107fac0>] ? kthread+0x0/0xa0
[<ffffffff8100cc00>] ? kernel_thread_helper+0x0/0x10
INFO: task updatedb:2342 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
updatedb D ffff88011bd77af0 5136 2342 2323 0x00000080
ffff88011c877cb8 0000000000000086 0000000000000000 ffff88011c1829a8
0000000000014d00 ffff88011bd77560 ffff88011bd77af0 ffff88011c877fd8
ffff88011bd77af8 0000000000014d00 ffff88011c876010 0000000000014d00
Call Trace:
[<ffffffff81165130>] ? sync_buffer+0x0/0x50
[<ffffffff8149382b>] io_schedule+0x6b/0xb0
[<ffffffff8116516b>] sync_buffer+0x3b/0x50
[<ffffffff81494057>] __wait_on_bit+0x57/0x80
[<ffffffff811699c6>] ? bio_alloc_bioset+0x56/0xf0
[<ffffffff81165130>] ? sync_buffer+0x0/0x50
[<ffffffff814940f3>] out_of_line_wait_on_bit+0x73/0x90
[<ffffffff81080210>] ? wake_bit_function+0x0/0x40
[<ffffffff81165126>] __wait_on_buffer+0x26/0x30
[<ffffffffa038006c>] ext3_bread+0x5c/0x80 [ext3]
[<ffffffffa037ba63>] ext3_readdir+0x1f3/0x600 [ext3]
[<ffffffff8114a650>] ? filldir+0x0/0xe0
[<ffffffff8114a650>] ? filldir+0x0/0xe0
[<ffffffff8114a7e0>] vfs_readdir+0xb0/0xd0
[<ffffffff8114a964>] sys_getdents+0x84/0xf0
[<ffffffff8100bdd2>] system_call_fastpath+0x16/0x1b


diff --git a/block/genhd.c b/block/genhd.c
index cbf1112..6a5b772 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1355,7 +1355,7 @@ int invalidate_partition(struct gendisk *disk, int partno)
struct block_device *bdev = bdget_disk(disk, partno);
if (bdev) {
fsync_bdev(bdev);
- res = __invalidate_device(bdev, true);
+ res = __invalidate_device(bdev);
bdput(bdev);
}
return res;
diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index 77fc76f..b9ba04f 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -3281,7 +3281,7 @@ static int set_geometry(unsigned int cmd, struct floppy_struct *g,
struct block_device *bdev = opened_bdev[cnt];
if (!bdev || ITYPE(drive_state[cnt].fd_device) != type)
continue;
- __invalidate_device(bdev, true);
+ __invalidate_device(bdev);
}
mutex_unlock(&open_lock);
} else {
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 8892870..5aae241 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -934,9 +934,9 @@ EXPORT_SYMBOL_GPL(bd_unlink_disk_holder);
* when a disk has been changed -- either by a media change or online
* resize.
*/
-static void flush_disk(struct block_device *bdev, bool kill_dirty)
+static void flush_disk(struct block_device *bdev)
{
- if (__invalidate_device(bdev, kill_dirty)) {
+ if (__invalidate_device(bdev)) {
char name[BDEVNAME_SIZE] = "";

if (bdev->bd_disk)
@@ -973,7 +973,7 @@ void check_disk_size_change(struct gendisk *disk, struct block_device *bdev)
"%s: detected capacity change from %lld to %lld\n",
name, bdev_size, disk_size);
i_size_write(bdev->bd_inode, disk_size);
- flush_disk(bdev, false);
+ fsync_bdev(bdev);
}
}
EXPORT_SYMBOL(check_disk_size_change);
@@ -1026,7 +1026,7 @@ int check_disk_change(struct block_device *bdev)
if (!(events & DISK_EVENT_MEDIA_CHANGE))
return 0;

- flush_disk(bdev, true);
+ flush_disk(bdev);
if (bdops->revalidate_disk)
bdops->revalidate_disk(bdev->bd_disk);
return 1;
@@ -1607,7 +1607,7 @@ fail:
}
EXPORT_SYMBOL(lookup_bdev);

-int __invalidate_device(struct block_device *bdev, bool kill_dirty)
+int __invalidate_device(struct block_device *bdev)
{
struct super_block *sb = get_super(bdev);
int res = 0;
@@ -1620,7 +1620,7 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty)
* hold).
*/
shrink_dcache_sb(sb);
- res = invalidate_inodes(sb, kill_dirty);
+ res = invalidate_inodes(sb);
drop_super(sb);
}
invalidate_bdev(bdev);
diff --git a/fs/inode.c b/fs/inode.c
index 0647d80..9c2b795 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -548,14 +548,11 @@ void evict_inodes(struct super_block *sb)
/**
* invalidate_inodes - attempt to free all inodes on a superblock
* @sb: superblock to operate on
- * @kill_dirty: flag to guide handling of dirty inodes
*
* Attempts to free all inodes for a given superblock. If there were any
* busy inodes return a non-zero value, else zero.
- * If @kill_dirty is set, discard dirty inodes too, otherwise treat
- * them as busy.
*/
-int invalidate_inodes(struct super_block *sb, bool kill_dirty)
+int invalidate_inodes(struct super_block *sb)
{
int busy = 0;
struct inode *inode, *next;
@@ -567,10 +564,6 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE))
continue;
- if (inode->i_state & I_DIRTY && !kill_dirty) {
- busy = 1;
- continue;
- }
if (atomic_read(&inode->i_count)) {
busy = 1;
continue;
diff --git a/fs/internal.h b/fs/internal.h
index f3d15de..bee95ea 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -125,4 +125,4 @@ extern long do_handle_open(int mountdirfd,
*/
extern int get_nr_dirty_inodes(void);
extern void evict_inodes(struct super_block *);
-extern int invalidate_inodes(struct super_block *, bool);
+extern int invalidate_inodes(struct super_block *);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 13df14e..ff9a159 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2155,7 +2155,7 @@ extern void check_disk_size_change(struct gendisk *disk,
struct block_device *bdev);
extern int revalidate_disk(struct gendisk *);
extern int check_disk_change(struct block_device *);
-extern int __invalidate_device(struct block_device *, bool);
+extern int __invalidate_device(struct block_device *);
extern int invalidate_partition(struct gendisk *, int);
#endif
unsigned long invalidate_mapping_pages(struct address_space *mapping,
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/