Re: PANIC: "Oops: 0000 [#1] PREEMPT SMP PTI" starting from 5.17 on dual socket Intel Xeon Gold servers

From: Minchan Kim
Date: Mon Apr 04 2022 - 17:27:05 EST


On Fri, Apr 01, 2022 at 02:04:03PM +0200, Jirka Hladky wrote:
> > Could you decode exact source code line from the oops?
>
> Yes - please see below [1].

Thanks.

>
> > I think it's fine to attach in the reply because kernel test bot
>
> OK. The reproducer is attached. Please unpack it and follow the
> instructions in the README file. [2]

Unfortunately, I failed to run the script in my machine.

>
> Thanks a lot for looking into it!
> Jirka
>
> [1]
> =============================================
> Source code line numbers for the Oops message
> =============================================
>
> 1) RIP: 0010:kernfs_remove+0x8/0x50:
> (gdb) l *kernfs_remove+0x8
> 0xffffffff81418588 is in kernfs_remove (fs/kernfs/kernfs-internal.h:48).
> 43 * Return the kernfs_root @kn belongs to.
> 44 */
> 45 static inline struct kernfs_root *kernfs_root(struct kernfs_node *kn)
> 46 {
> 47 /* if parent exists, it's always a dir; otherwise, @sd
> is a dir */
> 48 if (kn->parent)
> 49 kn = kn->parent;
> 50 return kn->dir.root;
> 51 }
>
> And here are source code lines from the 5 first functions in call trace:
> [ 8563.366280] Call Trace:
> [ 8563.366280] <TASK>
> [ 8563.366280] rdt_kill_sb+0x29d/0x350
> [ 8563.366280] deactivate_locked_super+0x36/0xa0
> [ 8563.366280] cleanup_mnt+0x131/0x190
> [ 8563.366280] task_work_run+0x5c/0x90
> [ 8563.366280] exit_to_user_mode_prepare+0x229/0x230
> [ 8563.366280] syscall_exit_to_user_mode+0x18/0x40
> [ 8563.366280] do_syscall_64+0x48/0x90
> [ 8563.366280] entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> 2)(gdb) l *rdt_kill_sb+0x29d
> 0xffffffff810506bd is in rdt_kill_sb
> (arch/x86/kernel/cpu/resctrl/rdtgroup.c:2442).
> 2437 /* Notify online CPUs to update per cpu storage and
> PQR_ASSOC MSR */
> 2438 update_closid_rmid(cpu_online_mask, &rdtgroup_default);
> 2439
> 2440 kernfs_remove(kn_info);
> 2441 kernfs_remove(kn_mongrp);
> 2442 kernfs_remove(kn_mondata);
> 2443 }
>
> 3)(gdb) l *deactivate_locked_super+0x36
> 0xffffffff813650f6 is in deactivate_locked_super (fs/super.c:342).
> 337 /*
> 338 * Since list_lru_destroy() may sleep, we
> cannot call it from
> 339 * put_super(), where we hold the sb_lock.
> Therefore we destroy
> 340 * the lru lists right now.
> 341 */
> 342 list_lru_destroy(&s->s_dentry_lru);
> 343 list_lru_destroy(&s->s_inode_lru);
> 344
> 345 put_filesystem(fs);
> 346 put_super(s);
>
> 4) (gdb) l *cleanup_mnt+0x131
> 0xffffffff813890a1 is in cleanup_mnt (fs/namespace.c:137).
> 132 return 0;
> 133 }
> 134
> 135 static void mnt_free_id(struct mount *mnt)
> 136 {
> 137 ida_free(&mnt_id_ida, mnt->mnt_id);
> 138 }
>
> 5) (gdb) l *task_work_run+0x5c
> 0xffffffff8110620c is in task_work_run (./include/linux/sched.h:2017).
> 2012
> 2013 DECLARE_STATIC_CALL(cond_resched, __cond_resched);
> 2014
> 2015 static __always_inline int _cond_resched(void)
> 2016 {
> 2017 return static_call_mod(cond_resched)();
> 2018 }
>
> 6) (gdb) l *exit_to_user_mode_prepare+0x229
> 0xffffffff81176d19 is in exit_to_user_mode_prepare
> (./include/linux/tracehook.h:189).
> 184 * This barrier pairs with
> task_work_add()->set_notify_resume() after
> 185 * hlist_add_head(task->task_works);
> 186 */
> 187 smp_mb__after_atomic();
> 188 if (unlikely(current->task_works))
> 189 task_work_run();
> 190
> 191 #ifdef CONFIG_KEYS_REQUEST_CACHE
> 192 if (unlikely(current->cached_requested_key)) {
> 193 key_put(current->cached_requested_key);
>
> [2]
> =============================================
> Reproducer - README
> =============================================
>
> 1) HW
> This issue seems to be platform specific. I was not able to reproduce
> it on AMD Zen and also not on Intel Ice Lake platform.
> I see the issue on dual socket Intel Skylake systems. Reproduced on a
> Supermicro Super Server/X11DDW-L with 2x Xeon Gold 6126 CPU.

Based on your report, kernel was crashed due to kn_mondata was NULL

rdt_kill_sb
rmdir_all_sub
..
kernfs_remove(kn_mondata);
struct kernfs_root *root = kernfs_root(kn); <-- crashed


Before the my patch[1], it worked like this.

rdt_kill_sb
rmdir_all_sub
..
kernfs_remove(kn_mondata);
down_write(&kernfs_rwsem);
if (!kn)
return;
up_write(&kernfs_rwsem);

IOW, before, kernfs_remove worked with NULL argument via just bailing
but with the my patch[1], it doesn't work any longer.

It makes me have questions for kernfs maintainers:

Should kernfs_remove API support NULL parameter? If so, can we support
it atomically without old global kernfs_rwsem?

[1] 393c3714081a, kernfs: switch global kernfs_rwsem lock to per-fs lock