RE: [PATCH v2] kasan: fix deadlock in start_report()

From: 袁帅(Shuai Yuan)
Date: Wed Feb 15 2023 - 08:23:09 EST


> On Friday, February 10, 2023 at 6:54 AM Andrey Konovalov
> <andreyknvl@xxxxxxxxx>
> wrote:
> > On Thu, Feb 9, 2023 at 11:44 AM Dmitry Vyukov <dvyukov@xxxxxxxxxx>
> > wrote:
> > >
> > > On Thu, 9 Feb 2023 at 10:19, 袁帅(Shuai Yuan) <yuanshuai@xxxxxxxx>
> > wrote:
> > > >
> > > > Hi Dmitry Vyukov
> > > >
> > > > Thanks, I see that your means.
> > > >
> > > > Currently, report_suppressed() seem not work in Kasan-HW mode, it
> > always return false.
> > > > Do you think should change the report_suppressed function?
> > > > I don't know why CONFIG_KASAN_HW_TAGS was blocked separately
> > before.
> > >
> > > That logic was added by Andrey in:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/c
> > > om
> > > mit/?id=c068664c97c7cf
> > >
> > > Andrey, can we make report_enabled() check current->kasan_depth and
> > > remove report_suppressed()?
> >
> > I decided to not use kasan_depth for HW_TAGS, as we can always use a
> > match-all tag to make "invalid" memory accesses.
> >
> > I think we can fix the reporting code to do exactly that so that it
> > doesn't cause MTE faults.
> >
> > Shuai, could you clarify, at which point due kasan_report_invalid_free
> > an MTE exception is raised in your tests?
>
> Yes, I need some time to clarify this problem with a clear log by test.
>

Hi Andrey and Dmitry

I have got valid information to clarify the problem and solutions. I made
a few changes to the code to do this.

a) I was testing on a device that had hardware issues with MTE,
and the memory tag sometimes changed randomly.

b) I did this test on kernel version 5.15, but this problem should
exist on the latest kernel version from a code perspective.

c) Run the kernel with a single core by "maxcpus=1".

d) Code modify,
(1) Call dump_stack_lvl(KERN_ERR) when start_report() returns false,
this is done based on the current patch v2.

(2) Add some log in print_address_description() to show kmem_cache address
and memory tag.
https://elixir.bootlin.com/linux/v5.15.94/source/mm/kasan/report.c#L252
@@ -255,24 +260,25 @@ static void print_address_description(void *addr, u8 tag)

dump_stack_lvl(KERN_ERR);
pr_err("\n");
-
+pr_err("ys:1\n");
if (page && PageSlab(page)) {
struct kmem_cache *cache = page->slab_cache;
-void *object = nearest_obj(cache, page,addr);
+void *object = NULL;
+pr_err("ys:cache start %llx, mtag:%x, page_address:%llx\n",
+cache, hw_get_mem_tag(cache), page_address(page));
+object = nearest_obj(cache, page, addr);
+ pr_err("ys:cache end %llx, object %llx, page_address:%llx\n",
+ cache, object, page_address(page));
describe_object(cache, object, addr, tag);
}

(3) Add kasan_enable_tagging() to KUNIT_EXPECT_KASAN_FAIL in
https://elixir.bootlin.com/linux/v5.15.94/source/lib/test_kasan.c#L94
This ensures that kunit is tested on this unstable device.

e) With the above modification we can get the backtrace:
ys:1
ys:cache start f4ffff8140005380, mtag:fe, page_address:ffffff8140328000
ys:cache change f4ffff8140005380, mtag:fe, page_address:ffffff8140328000
ys: error address:f4ffff8140005398
Pointer tag: [f4], memory tag: [fe]
CPU: 0 PID: 100 Comm: kunit_try_catch Tainted:
Call trace:
dump_backtrace.cfi_jt+0x0/0x8
show_stack+0x28/0x38
dump_stack_lvl+0x68/0x98
__kasan_report+0x110/0x29c
kasan_report+0x40/0x8c
__do_kernel_fault+0xd4/0x2c4
do_bad_area+0x40/0x100
do_tag_check_fault+0x2c/0x40
do_mem_abort+0x74/0x138
el1_abort+0x40/0x64
el1h_64_sync_handler+0x60/0xa0
el1h_64_sync+0x7c/0x80
print_address_description+0x154/0x2e8
__kasan_report+0x200/0x29c
kasan_report+0x40/0x8c
__do_kernel_fault+0xd4/0x2c4
do_bad_area+0x40/0x100
do_tag_check_fault+0x2c/0x40
do_mem_abort+0x74/0x138
el1_abort+0x40/0x64
el1h_64_sync_handler+0x60/0xa0
el1h_64_sync+0x7c/0x80
enqueue_entity+0x23c/0x4b8
enqueue_task_fair+0x13c/0x48c
enqueue_task.llvm.1684042887774774428+0xd0/0x250
__do_set_cpus_allowed+0x1ac/0x304
__set_cpus_allowed_ptr_locked+0x168/0x28c
migrate_enable+0xf0/0x17c
kasan_strings+0x59c/0x72c
kunit_try_run_case+0x84/0x128
kunit_generic_run_threadfn_adapter+0x48/0x80
kthread+0x17c/0x1e8
ret_from_fork+0x10/0x20
ys:cache end f4ffff8140005380, object ffffff814032ca00, page_address:ffffff8140328000

f) From the above log, you can see that the system tried to call kasan_report() twice,
because we visit tag address by kmem_cache and this tag have change..
Normally this doesn't happen easily. So I think we can add kasan_reset_tag() to handle
the kmem_cache address.

For example, the following changes are used for the latest kernel version.
diff --git a/mm/kasan/report.c b/mm/kasan/report.c
--- a/mm/kasan/report.c
+++ b/mm/kasan/report.c
@@ -412,7 +412,7 @@ static void complete_report_info(struct kasan_report_info *info)
slab = kasan_addr_to_slab(addr);
if (slab) {
- info->cache = slab->slab_cache;
+ info->cache = kasan_reset_tag(slab->slab_cache);
info->object = nearest_obj(info->cache, slab, addr);

I have tested Kernel5.15 using a similar approach and it seems to work.
On the other hand, I think there should be other solutions and hope to get your feedback.
Thanks a lot.

> > > Then we can also remove the comment in kasan_report_invalid_free().
> > >
> > > It looks like kasan_disable_current() in kmemleak needs to affect
> > > HW_TAGS mode as well:
> > > https://elixir.bootlin.com/linux/v6.2-rc7/source/mm/kmemleak.c#L301
> >
> > It uses kasan_reset_tag, so it should work properly with HW_TAGS.
ZEKU
信息安全声明:本邮件包含信息归发件人所在组织ZEKU所有。 禁止任何人在未经授权的情况下以任何形式(包括但不限于全部或部分披露、复制或传播)使用包含的信息。若您错收了本邮件,请立即电话或邮件通知发件人,并删除本邮件及附件。
Information Security Notice: The information contained in this mail is solely property of the sender's organization ZEKU. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this email in error, please notify the sender by phone or email immediately and delete it.