Re: [PATCH] zram: fix umount-reset_store-mount race condition

From: Minchan Kim
Date: Mon Feb 02 2015 - 22:01:46 EST


Hello Sergey,

On Mon, Feb 02, 2015 at 11:08:40PM +0900, Sergey Senozhatsky wrote:
> Ganesh Mahendran was the first one who proposed to use bdev->bd_mutex
> to avoid ->bd_holders race condition:
>
> CPU0 CPU1
> umount /* zram->init_done is true */
> reset_store()
> bdev->bd_holders == 0 mount
> ... zram_make_request()
> zram_reset_device()
>
> However, his solution required some considerable amount of code movement,
> which we can avoid.
>
> Apart from using bdev->bd_mutex in reset_store(), this patch also
> simplifies zram_reset_device().
>
> zram_reset_device() has a bool parameter reset_capacity which tells
> it whether disk capacity and itself disk should be reset. There are
> two zram_reset_device() callers:
> -- zram_exit() passes reset_capacity=false
> -- reset_store() passes reset_capacity=true
>
> So we can move reset_capacity-sensitive work out of zram_reset_device()
> and perform it unconditionally in reset_store(). This also lets us drop
> reset_capacity parameter from zram_reset_device() and pass zram pointer
> only.
>
> Reported-by: Ganesh Mahendran <opensource.ganesh@xxxxxxxxx>
> Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@xxxxxxxxx>

Looks good to me but as I told you, we need to check processes open
the block device as !FMODE_EXCL.
I can make following warning with below simple script.
In addition, I added msleep(2000) below set_capacity(zram->disk, 0)
after applying your patch to make window huge(Kudos to Ganesh!)

#!/bin/bash

echo $((60<<30)) > /sys/block/zram0/disksize
# Check your kernel turns on CONFIG_SCHED_AUTOGROUP=y
# I have no idea why it doesn't happen without setsid and CONFIG_SCHED_AUTOGROUP=y
# Please let me know it if someone can know the reason.
setsid dd if=/dev/zero of=/dev/zram0 &
sleep 1
setsid echo 1 > /sys/block/zram0/reset

[ 30.683449] zram0: detected capacity change from 0 to 64424509440
[ 33.736869] Buffer I/O error on dev zram0, logical block 180823, lost async page write
[ 33.738814] Buffer I/O error on dev zram0, logical block 180824, lost async page write
[ 33.740654] Buffer I/O error on dev zram0, logical block 180825, lost async page write
[ 33.742551] Buffer I/O error on dev zram0, logical block 180826, lost async page write
[ 33.744153] Buffer I/O error on dev zram0, logical block 180827, lost async page write
[ 33.745807] Buffer I/O error on dev zram0, logical block 180828, lost async page write
[ 33.747419] Buffer I/O error on dev zram0, logical block 180829, lost async page write
[ 33.749060] Buffer I/O error on dev zram0, logical block 180830, lost async page write
[ 33.750687] Buffer I/O error on dev zram0, logical block 180831, lost async page write
[ 33.752286] Buffer I/O error on dev zram0, logical block 180832, lost async page write
[ 33.811590] ------------[ cut here ]------------
[ 33.812038] WARNING: CPU: 11 PID: 1996 at fs/block_dev.c:57 __blkdev_put+0x1d7/0x210()
[ 33.812817] Modules linked in:
[ 33.813142] CPU: 11 PID: 1996 Comm: dd Not tainted 3.19.0-rc6-next-20150202+ #1125
[ 33.813837] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[ 33.814525] ffffffff81801a2d ffff880061e77db8 ffffffff815b848e 0000000000000001
[ 33.815196] 0000000000000000 ffff880061e77df8 ffffffff8104de2a 0000000000000000
[ 33.815867] ffff88005da287f0 ffff88005da28680 ffff88005da28770 ffff88005da28698
[ 33.816536] Call Trace:
[ 33.816817] [<ffffffff815b848e>] dump_stack+0x45/0x57
[ 33.817304] [<ffffffff8104de2a>] warn_slowpath_common+0x8a/0xc0
[ 33.817829] [<ffffffff8104df1a>] warn_slowpath_null+0x1a/0x20
[ 33.818331] [<ffffffff811b60b7>] __blkdev_put+0x1d7/0x210
[ 33.818797] [<ffffffff811b69c0>] blkdev_put+0x50/0x130
[ 33.819244] [<ffffffff811b6b55>] blkdev_close+0x25/0x30
[ 33.819723] [<ffffffff8118079f>] __fput+0xdf/0x1e0
[ 33.820140] [<ffffffff811808ee>] ____fput+0xe/0x10
[ 33.820576] [<ffffffff81068e07>] task_work_run+0xa7/0xe0
[ 33.821151] [<ffffffff81002b89>] do_notify_resume+0x49/0x60
[ 33.821721] [<ffffffff815bf09d>] int_signal+0x12/0x17
[ 33.822228] ---[ end trace 274fbbc5664827d2 ]---

The warning comes from bdev_write_node in blkdev_put path.

tatic void bdev_write_inode(struct inode *inode)
{
spin_lock(&inode->i_lock);
while (inode->i_state & I_DIRTY) {
spin_unlock(&inode->i_lock);
WARN_ON_ONCE(write_inode_now(inode, true)); <========= here.
spin_lock(&inode->i_lock);
}
spin_unlock(&inode->i_lock);
}

The reason is dd process encounters I/O fails due to sudden block device
disappear so in filemap_check_errors in __writeback_single_inode returns
-EIO.

If we checks bd_openners instead of bd_holders, we could address the
problem. When I see the brd, it already have used it rather than
bd_holders so although I'm not a expert of block layer, it seems
to be better.

> ---
> drivers/block/zram/zram_drv.c | 23 +++++++++--------------
> 1 file changed, 9 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index aa5a4c5..a32069f 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -717,7 +717,7 @@ static void zram_bio_discard(struct zram *zram, u32 index,
> }
> }
>
> -static void zram_reset_device(struct zram *zram, bool reset_capacity)
> +static void zram_reset_device(struct zram *zram)
> {
> down_write(&zram->init_lock);
>
> @@ -736,18 +736,7 @@ static void zram_reset_device(struct zram *zram, bool reset_capacity)
> memset(&zram->stats, 0, sizeof(zram->stats));
>
> zram->disksize = 0;
> - if (reset_capacity)
> - set_capacity(zram->disk, 0);
> -
> up_write(&zram->init_lock);
> -
> - /*
> - * Revalidate disk out of the init_lock to avoid lockdep splat.
> - * It's okay because disk's capacity is protected by init_lock
> - * so that revalidate_disk always sees up-to-date capacity.
> - */
> - if (reset_capacity)
> - revalidate_disk(zram->disk);
> }
>
> static ssize_t disksize_store(struct device *dev,
> @@ -820,6 +809,7 @@ static ssize_t reset_store(struct device *dev,
> if (!bdev)
> return -ENOMEM;
>
> + mutex_lock(&bdev->bd_mutex);
> /* Do not reset an active device! */
> if (bdev->bd_holders) {
> ret = -EBUSY;
> @@ -837,12 +827,17 @@ static ssize_t reset_store(struct device *dev,
>
> /* Make sure all pending I/O is finished */
> fsync_bdev(bdev);
> + zram_reset_device(zram);
> + set_capacity(zram->disk, 0);
> +
> + mutex_unlock(&bdev->bd_mutex);
> + revalidate_disk(zram->disk);
> bdput(bdev);
>
> - zram_reset_device(zram, true);
> return len;
>
> out:
> + mutex_unlock(&bdev->bd_mutex);
> bdput(bdev);
> return ret;
> }
> @@ -1188,7 +1183,7 @@ static void __exit zram_exit(void)
> * Shouldn't access zram->disk after destroy_device
> * because destroy_device already released zram->disk.
> */
> - zram_reset_device(zram, false);
> + zram_reset_device(zram);
> }
>
> unregister_blkdev(zram_major, "zram");
> --
> 2.3.0.rc2
>

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/