Re: [PATCH v2] f2fs: fix to avoid NULL pointer dereference on se->discard_map

From: Chao Yu
Date: Tue Sep 04 2018 - 21:11:04 EST


On 2018/9/5 0:00, Vicente Bergas wrote:
> On Tue, Sep 4, 2018 at 5:45 PM, Chao Yu <chao@xxxxxxxxxx> wrote:
>> On 2018/9/4 23:25, Vicente Bergas wrote:
>>> On Mon, Sep 3, 2018 at 9:52 PM, Chao Yu <chao@xxxxxxxxxx> wrote:
>>>> From: Chao Yu <yuchao0@xxxxxxxxxx>
>>>>
>>>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D200951&amp;data=02%7C01%7C%7Cc2be7ee866b04268e69f08d6127aa973%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636716715374991374&amp;sdata=E%2Boli5wWbe97f3QCdmAiIIZPEMInd9u221ldtVDqvtA%3D&amp;reserved=0
>>>>
>>>> These is a NULL pointer dereference issue reported in bugzilla:
>>>>
>>>> Hi,
>>>> in the setup there is a SATA SSD connected to a SATA-to-USB bridge.
>>>>
>>>> The disc is "Samsung SSD 850 PRO 256G" which supports TRIM.
>>>> There are four partitions:
>>>> sda1: FAT /boot
>>>> sda2: F2FS /
>>>> sda3: F2FS /home
>>>> sda4: F2FS
>>>>
>>>> The bridge is ASMT1153e which uses the "uas" driver.
>>>> There is no TRIM pass-through, so, when mounting it reports:
>>>> mounting with "discard" option, but the device does not support discard
>>>>
>>>> The USB host is USB3.0 and UASP capable. It is the one on RK3399.
>>>>
>>>> Given this everything works fine, except there is no TRIM support.
>>>>
>>>> In order to enable TRIM a new UDEV rule is added [1]:
>>>> /etc/udev/rules.d/10-sata-bridge-trim.rules:
>>>> ACTION=="add|change", ATTRS{idVendor}=="174c", ATTRS{idProduct}=="55aa", SUBSYSTEM=="scsi_disk", ATTR{provisioning_mode}="unmap"
>>>> After reboot any F2FS write hangs forever and dmesg reports:
>>>> Unable to handle kernel NULL pointer dereference
>>>>
>>>> Also tested on a x86_64 system: works fine even with TRIM enabled.
>>>> same disc
>>>> same bridge
>>>> different usb host controller
>>>> different cpu architecture
>>>> not root filesystem
>>>>
>>>> Regards,
>>>> VicenÃ.
>>>>
>>>> [1] Post #5 in https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbbs.archlinux.org%2Fviewtopic.php%3Fid%3D236280&amp;data=02%7C01%7C%7Cc2be7ee866b04268e69f08d6127aa973%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636716715374991374&amp;sdata=tLP2J%2BL2MPDnqbLm1JcmJ7HfM%2F9j%2F0xc2MET2QSAjVE%3D&amp;reserved=0
>>>>
>>>> Unable to handle kernel NULL pointer dereference at virtual address 000000000000003e
>>>> Mem abort info:
>>>> ESR = 0x96000004
>>>> Exception class = DABT (current EL), IL = 32 bits
>>>> SET = 0, FnV = 0
>>>> EA = 0, S1PTW = 0
>>>> Data abort info:
>>>> ISV = 0, ISS = 0x00000004
>>>> CM = 0, WnR = 0
>>>> user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000626e3122
>>>> [000000000000003e] pgd=0000000000000000
>>>> Internal error: Oops: 96000004 [#1] SMP
>>>> Modules linked in: overlay snd_soc_hdmi_codec rc_cec dw_hdmi_i2s_audio dw_hdmi_cec snd_soc_simple_card snd_soc_simple_card_utils snd_soc_rockchip_i2s rockchip_rga snd_soc_rockchip_pcm rockchipdrm videobuf2_dma_sg v4l2_mem2mem rtc_rk808 videobuf2_memops analogix_dp videobuf2_v4l2 videobuf2_common dw_hdmi dw_wdt cec rc_core videodev drm_kms_helper media drm rockchip_thermal rockchip_saradc realtek drm_panel_orientation_quirks syscopyarea sysfillrect sysimgblt fb_sys_fops dwmac_rk stmmac_platform stmmac pwm_bl squashfs loop crypto_user gpio_keys hid_kensington
>>>> CPU: 5 PID: 957 Comm: nvim Not tainted 4.19.0-rc1-1-ARCH #1
>>>> Hardware name: Sapphire-RK3399 Board (DT)
>>>> pstate: 00000005 (nzcv daif -PAN -UAO)
>>>> pc : update_sit_entry+0x304/0x4b0
>>>> lr : update_sit_entry+0x108/0x4b0
>>>> sp : ffff00000ca13bd0
>>>> x29: ffff00000ca13bd0 x28: 000000000000003e
>>>> x27: 0000000000000020 x26: 0000000000080000
>>>> x25: 0000000000000048 x24: ffff8000ebb85cf8
>>>> x23: 0000000000000253 x22: 00000000ffffffff
>>>> x21: 00000000000535f2 x20: 00000000ffffffdf
>>>> x19: ffff8000eb9e6800 x18: ffff8000eb9e6be8
>>>> x17: 0000000007ce6926 x16: 000000001c83ffa8
>>>> x15: 0000000000000000 x14: ffff8000f602df90
>>>> x13: 0000000000000006 x12: 0000000000000040
>>>> x11: 0000000000000228 x10: 0000000000000000
>>>> x9 : 0000000000000000 x8 : 0000000000000000
>>>> x7 : 00000000000535f2 x6 : ffff8000ebff3440
>>>> x5 : ffff8000ebff3440 x4 : ffff8000ebe3a6c8
>>>> x3 : 00000000ffffffff x2 : 0000000000000020
>>>> x1 : 0000000000000000 x0 : ffff8000eb9e5800
>>>> Process nvim (pid: 957, stack limit = 0x0000000063a78320)
>>>> Call trace:
>>>> update_sit_entry+0x304/0x4b0
>>>> f2fs_invalidate_blocks+0x98/0x140
>>>> truncate_node+0x90/0x400
>>>> f2fs_remove_inode_page+0xe8/0x340
>>>> f2fs_evict_inode+0x2b0/0x408
>>>> evict+0xe0/0x1e0
>>>> iput+0x160/0x260
>>>> do_unlinkat+0x214/0x298
>>>> __arm64_sys_unlinkat+0x3c/0x68
>>>> el0_svc_handler+0x94/0x118
>>>> el0_svc+0x8/0xc
>>>> Code: f9400800 b9488400 36080140 f9400f01 (387c4820)
>>>> ---[ end trace a0f21a307118c477 ]---
>>>>
>>>> The reason is it is possible to enable discard flag on block queue via
>>>> UDEV, but during mount, f2fs will initialize se->discard_map only if
>>>> this flag is set, once the flag is set after mount, f2fs may dereference
>>>> NULL pointer on se->discard_map.
>>>>
>>>> So this patch does below changes to fix this issue:
>>>> - initialize and update se->discard_map all the time.
>>>> - don't clear DISCARD option if device has no QUEUE_FLAG_DISCARD flag
>>>> during mount.
>>>> - don't issue small discard on zoned block device.
>>>> - introduce some functions to enhance the readability.
>>>>
>>>> Signed-off-by: Chao Yu <yuchao0@xxxxxxxxxx>
>>>> ---
>>>> v2:
>>>> - rebase to last dev branch.
>>>> fs/f2fs/debug.c | 3 +--
>>>> fs/f2fs/f2fs.h | 15 ++++++++---
>>>> fs/f2fs/file.c | 2 +-
>>>> fs/f2fs/segment.c | 65 ++++++++++++++++++++---------------------------
>>>> fs/f2fs/super.c | 16 +++---------
>>>> 5 files changed, 46 insertions(+), 55 deletions(-)
>>>>
>>>> diff --git a/fs/f2fs/debug.c b/fs/f2fs/debug.c
>>>> index 214a968962a1..ebe649d9793c 100644
>>>> --- a/fs/f2fs/debug.c
>>>> +++ b/fs/f2fs/debug.c
>>>> @@ -190,8 +190,7 @@ static void update_mem_info(struct f2fs_sb_info *sbi)
>>>> si->base_mem += MAIN_SEGS(sbi) * sizeof(struct seg_entry);
>>>> si->base_mem += f2fs_bitmap_size(MAIN_SEGS(sbi));
>>>> si->base_mem += 2 * SIT_VBLOCK_MAP_SIZE * MAIN_SEGS(sbi);
>>>> - if (f2fs_discard_en(sbi))
>>>> - si->base_mem += SIT_VBLOCK_MAP_SIZE * MAIN_SEGS(sbi);
>>>> + si->base_mem += SIT_VBLOCK_MAP_SIZE * MAIN_SEGS(sbi);
>>>> si->base_mem += SIT_VBLOCK_MAP_SIZE;
>>>> if (sbi->segs_per_sec > 1)
>>>> si->base_mem += MAIN_SECS(sbi) * sizeof(struct sec_entry);
>>>> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
>>>> index 96bde026636f..b575ec75ef87 100644
>>>> --- a/fs/f2fs/f2fs.h
>>>> +++ b/fs/f2fs/f2fs.h
>>>> @@ -3399,11 +3399,20 @@ static inline int get_blkz_type(struct f2fs_sb_info *sbi,
>>>> }
>>>> #endif
>>>>
>>>> -static inline bool f2fs_discard_en(struct f2fs_sb_info *sbi)
>>>> +static inline bool f2fs_hw_should_discard(struct f2fs_sb_info *sbi)
>>>> {
>>>> - struct request_queue *q = bdev_get_queue(sbi->sb->s_bdev);
>>>> + return f2fs_sb_has_blkzoned(sbi->sb);
>>>> +}
>>>>
>>>> - return blk_queue_discard(q) || f2fs_sb_has_blkzoned(sbi->sb);
>>>> +static inline bool f2fs_hw_support_discard(struct f2fs_sb_info *sbi)
>>>> +{
>>>> + return blk_queue_discard(bdev_get_queue(sbi->sb->s_bdev));
>>>> +}
>>>> +
>>>> +static inline bool f2fs_realtime_discard_enable(struct f2fs_sb_info *sbi)
>>>> +{
>>>> + return (test_opt(sbi, DISCARD) && f2fs_hw_support_discard(sbi)) ||
>>>> + f2fs_hw_should_discard(sbi);
>>>> }
>>>>
>>>> static inline void set_opt_mode(struct f2fs_sb_info *sbi, unsigned int mt)
>>>> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
>>>> index 5474aaa274b9..ca66b3eb197a 100644
>>>> --- a/fs/f2fs/file.c
>>>> +++ b/fs/f2fs/file.c
>>>> @@ -1978,7 +1978,7 @@ static int f2fs_ioc_fitrim(struct file *filp, unsigned long arg)
>>>> if (!capable(CAP_SYS_ADMIN))
>>>> return -EPERM;
>>>>
>>>> - if (!blk_queue_discard(q))
>>>> + if (!f2fs_hw_support_discard(F2FS_SB(sb)))
>>>> return -EOPNOTSUPP;
>>>>
>>>> if (copy_from_user(&range, (struct fstrim_range __user *)arg,
>>>> diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
>>>> index 30779aaa9dba..c372ff755818 100644
>>>> --- a/fs/f2fs/segment.c
>>>> +++ b/fs/f2fs/segment.c
>>>> @@ -1725,11 +1725,11 @@ static bool add_discard_addrs(struct f2fs_sb_info *sbi, struct cp_control *cpc,
>>>> struct list_head *head = &SM_I(sbi)->dcc_info->entry_list;
>>>> int i;
>>>>
>>>> - if (se->valid_blocks == max_blocks || !f2fs_discard_en(sbi))
>>>> + if (se->valid_blocks == max_blocks || !f2fs_hw_support_discard(sbi))
>>>> return false;
>>>>
>>>> if (!force) {
>>>> - if (!test_opt(sbi, DISCARD) || !se->valid_blocks ||
>>>> + if (!f2fs_realtime_discard_enable(sbi) || !se->valid_blocks ||
>>>> SM_I(sbi)->dcc_info->nr_discards >=
>>>> SM_I(sbi)->dcc_info->max_discards)
>>>> return false;
>>>> @@ -1835,7 +1835,7 @@ void f2fs_clear_prefree_segments(struct f2fs_sb_info *sbi,
>>>> dirty_i->nr_dirty[PRE]--;
>>>> }
>>>>
>>>> - if (!test_opt(sbi, DISCARD))
>>>> + if (!f2fs_realtime_discard_enable(sbi))
>>>> continue;
>>>>
>>>> if (force && start >= cpc->trim_start &&
>>>> @@ -2025,8 +2025,7 @@ static void update_sit_entry(struct f2fs_sb_info *sbi, block_t blkaddr, int del)
>>>> del = 0;
>>>> }
>>>>
>>>> - if (f2fs_discard_en(sbi) &&
>>>> - !f2fs_test_and_set_bit(offset, se->discard_map))
>>>> + if (!f2fs_test_and_set_bit(offset, se->discard_map))
>>>> sbi->discard_blks--;
>>>>
>>>> /* don't overwrite by SSR to keep node chain */
>>>> @@ -2054,8 +2053,7 @@ static void update_sit_entry(struct f2fs_sb_info *sbi, block_t blkaddr, int del)
>>>> del = 0;
>>>> }
>>>>
>>>> - if (f2fs_discard_en(sbi) &&
>>>> - f2fs_test_and_clear_bit(offset, se->discard_map))
>>>> + if (f2fs_test_and_clear_bit(offset, se->discard_map))
>>>> sbi->discard_blks++;
>>>> }
>>>> if (!f2fs_test_bit(offset, se->ckpt_valid_map))
>>>> @@ -2671,7 +2669,7 @@ int f2fs_trim_fs(struct f2fs_sb_info *sbi, struct fstrim_range *range)
>>>> * discard option. User configuration looks like using runtime discard
>>>> * or periodic fstrim instead of it.
>>>> */
>>>> - if (test_opt(sbi, DISCARD))
>>>> + if (f2fs_realtime_discard_enable(sbi))
>>>> goto out;
>>>>
>>>> start_block = START_BLOCK(sbi, start_segno);
>>>> @@ -3762,13 +3760,11 @@ static int build_sit_info(struct f2fs_sb_info *sbi)
>>>> return -ENOMEM;
>>>> #endif
>>>>
>>>> - if (f2fs_discard_en(sbi)) {
>>>> - sit_i->sentries[start].discard_map
>>>> - = f2fs_kzalloc(sbi, SIT_VBLOCK_MAP_SIZE,
>>>> - GFP_KERNEL);
>>>> - if (!sit_i->sentries[start].discard_map)
>>>> - return -ENOMEM;
>>>> - }
>>>> + sit_i->sentries[start].discard_map
>>>> + = f2fs_kzalloc(sbi, SIT_VBLOCK_MAP_SIZE,
>>>> + GFP_KERNEL);
>>>> + if (!sit_i->sentries[start].discard_map)
>>>> + return -ENOMEM;
>>>> }
>>>>
>>>> sit_i->tmp_map = f2fs_kzalloc(sbi, SIT_VBLOCK_MAP_SIZE, GFP_KERNEL);
>>>> @@ -3916,18 +3912,16 @@ static int build_sit_entries(struct f2fs_sb_info *sbi)
>>>> total_node_blocks += se->valid_blocks;
>>>>
>>>> /* build discard map only one time */
>>>> - if (f2fs_discard_en(sbi)) {
>>>> - if (is_set_ckpt_flags(sbi, CP_TRIMMED_FLAG)) {
>>>> - memset(se->discard_map, 0xff,
>>>> - SIT_VBLOCK_MAP_SIZE);
>>>> - } else {
>>>> - memcpy(se->discard_map,
>>>> - se->cur_valid_map,
>>>> - SIT_VBLOCK_MAP_SIZE);
>>>> - sbi->discard_blks +=
>>>> - sbi->blocks_per_seg -
>>>> - se->valid_blocks;
>>>> - }
>>>> + if (is_set_ckpt_flags(sbi, CP_TRIMMED_FLAG)) {
>>>> + memset(se->discard_map, 0xff,
>>>> + SIT_VBLOCK_MAP_SIZE);
>>>> + } else {
>>>> + memcpy(se->discard_map,
>>>> + se->cur_valid_map,
>>>> + SIT_VBLOCK_MAP_SIZE);
>>>> + sbi->discard_blks +=
>>>> + sbi->blocks_per_seg -
>>>> + se->valid_blocks;
>>>> }
>>>>
>>>> if (sbi->segs_per_sec > 1)
>>>> @@ -3965,16 +3959,13 @@ static int build_sit_entries(struct f2fs_sb_info *sbi)
>>>> if (IS_NODESEG(se->type))
>>>> total_node_blocks += se->valid_blocks;
>>>>
>>>> - if (f2fs_discard_en(sbi)) {
>>>> - if (is_set_ckpt_flags(sbi, CP_TRIMMED_FLAG)) {
>>>> - memset(se->discard_map, 0xff,
>>>> - SIT_VBLOCK_MAP_SIZE);
>>>> - } else {
>>>> - memcpy(se->discard_map, se->cur_valid_map,
>>>> - SIT_VBLOCK_MAP_SIZE);
>>>> - sbi->discard_blks += old_valid_blocks;
>>>> - sbi->discard_blks -= se->valid_blocks;
>>>> - }
>>>> + if (is_set_ckpt_flags(sbi, CP_TRIMMED_FLAG)) {
>>>> + memset(se->discard_map, 0xff, SIT_VBLOCK_MAP_SIZE);
>>>> + } else {
>>>> + memcpy(se->discard_map, se->cur_valid_map,
>>>> + SIT_VBLOCK_MAP_SIZE);
>>>> + sbi->discard_blks += old_valid_blocks;
>>>> + sbi->discard_blks -= se->valid_blocks;
>>>> }
>>>>
>>>> if (sbi->segs_per_sec > 1) {
>>>> diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
>>>> index 53d70b64fea1..5cad0f7687e9 100644
>>>> --- a/fs/f2fs/super.c
>>>> +++ b/fs/f2fs/super.c
>>>> @@ -360,7 +360,6 @@ static int f2fs_check_quota_options(struct f2fs_sb_info *sbi)
>>>> static int parse_options(struct super_block *sb, char *options)
>>>> {
>>>> struct f2fs_sb_info *sbi = F2FS_SB(sb);
>>>> - struct request_queue *q;
>>>> substring_t args[MAX_OPT_ARGS];
>>>> char *p, *name;
>>>> int arg = 0;
>>>> @@ -415,14 +414,7 @@ static int parse_options(struct super_block *sb, char *options)
>>>> return -EINVAL;
>>>> break;
>>>> case Opt_discard:
>>>> - q = bdev_get_queue(sb->s_bdev);
>>>> - if (blk_queue_discard(q)) {
>>>> - set_opt(sbi, DISCARD);
>>>> - } else if (!f2fs_sb_has_blkzoned(sb)) {
>>>> - f2fs_msg(sb, KERN_WARNING,
>>>> - "mounting with \"discard\" option, but "
>>>> - "the device does not support discard");
>>>> - }
>>>> + set_opt(sbi, DISCARD);
>>>> break;
>>>> case Opt_nodiscard:
>>>> if (f2fs_sb_has_blkzoned(sb)) {
>>>> @@ -1032,7 +1024,8 @@ static void f2fs_put_super(struct super_block *sb)
>>>> /* be sure to wait for any on-going discard commands */
>>>> dropped = f2fs_wait_discard_bios(sbi);
>>>>
>>>> - if (f2fs_discard_en(sbi) && !sbi->discard_blks && !dropped) {
>>>> + if ((f2fs_hw_support_discard(sbi) || f2fs_hw_should_discard(sbi)) &&
>>>> + !sbi->discard_blks && !dropped) {
>>>> struct cp_control cpc = {
>>>> .reason = CP_UMOUNT | CP_TRIMMED,
>>>> };
>>>> @@ -1399,8 +1392,7 @@ static void default_options(struct f2fs_sb_info *sbi)
>>>> set_opt(sbi, NOHEAP);
>>>> sbi->sb->s_flags |= SB_LAZYTIME;
>>>> set_opt(sbi, FLUSH_MERGE);
>>>> - if (blk_queue_discard(bdev_get_queue(sbi->sb->s_bdev)))
>>>> - set_opt(sbi, DISCARD);
>>>> + set_opt(sbi, DISCARD);
>>>> if (f2fs_sb_has_blkzoned(sbi->sb))
>>>> set_opt_mode(sbi, F2FS_MOUNT_LFS);
>>>> else
>>>> --
>>>> 2.18.0
>>>>
>>>
>>> Hi,
>>> just tested and it works fine: no more segfaults.
>>> How can i check that TRIM is indeed being used?
>>
>> You can check discard trace:
>>
>> echo 1 > /sys/kernel/debug/tracing/events/f2fs/f2fs_issue_discard/enable
>> cat /sys/kernel/debug/tracing/trace_pipe
>
> I confirm that there are discarded blocks in two filesystems:
> a.- The root filesystem, which had the discard option applied after mount.
> b.- A non-root filesystem, which had the discard option applied before mount.
> So, complete success!

Thanks for your test. :)

Thanks,

>
> Regards,
> VicenÃ.
>
>>
>> Thanks,
>>
>>>
>>> Tested-by: Vicente Bergas <vicencb@xxxxxxxxx>
>>>
>>> Regards,
>>> VicenÃ.
>>>
>
> .
>