Re: kernel BUG in ext4_writepages

From: Tadeusz Struk
Date: Fri May 20 2022 - 10:50:31 EST


On 5/20/22 02:50, Jan Kara wrote:
On Thu 19-05-22 16:14:17, Tadeusz Struk wrote:
On 5/19/22 05:23, Jan Kara wrote:
Hi!

On Tue 10-05-22 15:28:38, Tadeusz Struk wrote:
Syzbot found another BUG in ext4_writepages [1].
This time it complains about inode with inline data.
C reproducer can be found here [2]
I was able to trigger it on 5.18.0-rc6

[1] https://syzkaller.appspot.com/bug?id=a1e89d09bbbcbd5c4cb45db230ee28c822953984
[2] https://syzkaller.appspot.com/text?tag=ReproC&x=129da6caf00000

Thanks for report. This should be fixed by:

https://lore.kernel.org/all/20220516012752.17241-1-yebin10@xxxxxxxxxx/


In case of the syzbot bug there is something messed up with PAGE DIRTY flags
and the way syzbot sets up the write. This is what triggers the crash:

Can you tell me where exactly we hit the bug? I've now noticed that this is
on 5.10 kernel and on vanilla 5.10 there's no BUG_ON on line 2753.

We are hiting this bug:
https://elixir.bootlin.com/linux/latest/source/fs/ext4/inode.c#L2707
Syzbot found it in v5.10, but I recreated it on 5.18-rc7, that's why
the line number mismatch. But this is the same bug.
On 5.10 it's in line 2739:
https://elixir.bootlin.com/linux/v5.10.117/source/fs/ext4/inode.c#L2739


$ ftrace -f ./repro
...
[pid 2395] open("./bus", O_RDWR|O_CREAT|O_SYNC|O_NOATIME, 000 <unfinished ...>
[pid 2395] <... open resumed> ) = 6
...
[pid 2395] write(6, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 22 <unfinished ...>
...
[pid 2395] <... write resumed> ) = 22

One way I could fix it was to clear the PAGECACHE_TAG_DIRTY on the mapping in
ext4_try_to_write_inline_data() after the page has been updated:

diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 9c076262770d..e4bbb53fa26f 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -715,6 +715,7 @@ int ext4_try_to_write_inline_data(struct address_space *mapping,
put_page(page);
goto out_up_read;
}
+ __xa_clear_mark(&mapping->i_pages, 0, PAGECACHE_TAG_DIRTY);
}
ret = 1;

Please let me know it if makes sense any I will send a proper patch.

No, this looks really wrong... We need to better understand what's going
on.

So I was afraid. I'm trying to diverge the ext4_writepages() to go to the
out_writepages path before we hit this BOG_ON().
Any hints will be much appreciated.

--
Thanks,
Tadeusz