Re: [PATCH v7 2/4] kernfs: Change kernfs_notify_list to llist.

From: Marek Szyprowski
Date: Fri Jul 01 2022 - 07:22:20 EST


Hi All,

On 15.06.2022 04:10, Imran Khan wrote:
> At present kernfs_notify_list is implemented as a singly linked
> list of kernfs_node(s), where last element points to itself and
> value of ->attr.next tells if node is present on the list or not.
> Both addition and deletion to list happen under kernfs_notify_lock.
>
> Change kernfs_notify_list to llist so that addition to list can heppen
> locklessly.
>
> Suggested by: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
> Signed-off-by: Imran Khan <imran.f.khan@xxxxxxxxxx>
> Acked-by: Tejun Heo <tj@xxxxxxxxxx>

This patch landed in linux next-20220630 as commit b8f35fa1188b
("kernfs: Change kernfs_notify_list to llist."). Unfortunately, it
causes serious regression on my test systems. It can be easily noticed
in the logs by the following warning:

------------[ cut here ]------------
WARNING: CPU: 1 PID: 34 at fs/kernfs/dir.c:531 kernfs_put.part.0+0x1a4/0x1d8
kernfs_put: console/active: released with incorrect active_ref 0
Modules linked in:
CPU: 1 PID: 34 Comm: kworker/1:4 Not tainted
5.19.0-rc4-05465-g5732b42edfd1 #12317
Hardware name: Samsung Exynos (Flattened Device Tree)
Workqueue: events kernfs_notify_workfn
 unwind_backtrace from show_stack+0x10/0x14
 show_stack from dump_stack_lvl+0x40/0x4c
 dump_stack_lvl from __warn+0xc8/0x13c
 __warn from warn_slowpath_fmt+0x90/0xb4
 warn_slowpath_fmt from kernfs_put.part.0+0x1a4/0x1d8
 kernfs_put.part.0 from kernfs_notify_workfn+0x1a0/0x1d0
 kernfs_notify_workfn from process_one_work+0x1ec/0x4cc
 process_one_work from worker_thread+0x58/0x54c
 worker_thread from kthread+0xd0/0xec
 kthread from ret_from_fork+0x14/0x2c
Exception stack(0xf099dfb0 to 0xf099dff8)
...
---[ end trace 0000000000000000 ]---

then, after some standard kernel messages, booting stops with the
following log:

rcu: INFO: rcu_sched self-detected stall on CPU
rcu:     1-...!: (2099 ticks this GP) idle=403c/1/0x40000002
softirq=33/33 fqs=1
 (t=2100 jiffies g=-1135 q=4 ncpus=2)
rcu: rcu_sched kthread starved for 2098 jiffies! g-1135 f0x0
RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=1
rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now
expected behavior.
rcu: RCU grace-period kthread stack dump:
task:rcu_sched       state:R  running task     stack:    0 pid: 10
ppid:     2 flags:0x00000000
 __schedule from schedule+0x48/0xb0
 schedule from schedule_timeout+0x84/0x138
 schedule_timeout from rcu_gp_fqs_loop+0x124/0x478
 rcu_gp_fqs_loop from rcu_gp_kthread+0x150/0x1c8
 rcu_gp_kthread from kthread+0xd0/0xec
 kthread from ret_from_fork+0x14/0x2c
Exception stack(0xf088dfb0 to 0xf088dff8)
...
rcu: Stack dump where RCU GP kthread last ran:
NMI backtrace for cpu 1
CPU: 1 PID: 34 Comm: kworker/1:4 Tainted: G        W
5.19.0-rc4-05465-g5732b42edfd1 #12317
Hardware name: Samsung Exynos (Flattened Device Tree)
Workqueue: events kernfs_notify_workfn
 unwind_backtrace from show_stack+0x10/0x14
 show_stack from dump_stack_lvl+0x40/0x4c
 dump_stack_lvl from nmi_cpu_backtrace+0xd0/0x104
 nmi_cpu_backtrace from nmi_trigger_cpumask_backtrace+0xd8/0x120
 nmi_trigger_cpumask_backtrace from
rcu_check_gp_kthread_starvation+0x138/0x154
 rcu_check_gp_kthread_starvation from rcu_sched_clock_irq+0x664/0x9f4
 rcu_sched_clock_irq from update_process_times+0x6c/0x94
 update_process_times from tick_sched_timer+0x60/0xe8
 tick_sched_timer from __hrtimer_run_queues+0x1b4/0x328
 __hrtimer_run_queues from hrtimer_interrupt+0x124/0x2b0
 hrtimer_interrupt from exynos4_mct_tick_isr+0x44/0x7c
 exynos4_mct_tick_isr from __handle_irq_event_percpu+0x7c/0x1cc
 __handle_irq_event_percpu from handle_irq_event+0x44/0x8c
 handle_irq_event from handle_fasteoi_irq+0x98/0x18c
 handle_fasteoi_irq from generic_handle_domain_irq+0x24/0x34
 generic_handle_domain_irq from gic_handle_irq+0x88/0xa8
 gic_handle_irq from generic_handle_arch_irq+0x34/0x44
 generic_handle_arch_irq from call_with_stack+0x18/0x20
 call_with_stack from __irq_svc+0x98/0xb0
Exception stack(0xf099de78 to 0xf099dec0)
...
 __irq_svc from up_write+0x4/0x3c
 up_write from 0xef7cbb00


It was not easy to find this, because bisecting points to commit
5732b42edfd1 ("Merge branch 'driver-core-next' of
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git").
This means that there are some non-trivial dependencies between both
merged branches. I've rebased 5732b42edfd1^2 onto 5732b42edfd1^1 and
then I've did the bisecting of the rebased commits. Finally I've found
this one. Then I've confirmed that reverting it on top of 5732b42edfd1
and then next-20220630 really fixes the issue.

Let me know how I can help fixing this issue. I've observed this issue
on ARM 32bit kernel compiled from multi_v7_defconfig and following
boards: Raspberry Pi 3b and 4b as well as Samsung Exynos based Odroid U3
and XU4. This issue doesn't happen on QEMU's ARM32bit 'virt' machine.


> ---
> fs/kernfs/file.c | 47 ++++++++++++++++++------------------------
> include/linux/kernfs.h | 2 +-
> 2 files changed, 21 insertions(+), 28 deletions(-)
>
> diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
> index 706aebcfb8f69..dce654ad2cea9 100644
> --- a/fs/kernfs/file.c
> +++ b/fs/kernfs/file.c
> @@ -38,18 +38,16 @@ struct kernfs_open_node {
> struct list_head files; /* goes through kernfs_open_file.list */
> };
>
> -/*
> - * kernfs_notify() may be called from any context and bounces notifications
> - * through a work item. To minimize space overhead in kernfs_node, the
> - * pending queue is implemented as a singly linked list of kernfs_nodes.
> - * The list is terminated with the self pointer so that whether a
> - * kernfs_node is on the list or not can be determined by testing the next
> - * pointer for NULL.
> +/**
> + * attribute_to_node - get kernfs_node object corresponding to a kernfs attribute
> + * @ptr: &struct kernfs_elem_attr
> + * @type: struct kernfs_node
> + * @member: name of member (i.e attr)
> */
> -#define KERNFS_NOTIFY_EOL ((void *)&kernfs_notify_list)
> +#define attribute_to_node(ptr, type, member) \
> + container_of(ptr, type, member)
>
> -static DEFINE_SPINLOCK(kernfs_notify_lock);
> -static struct kernfs_node *kernfs_notify_list = KERNFS_NOTIFY_EOL;
> +static LLIST_HEAD(kernfs_notify_list);
>
> /**
> * kernfs_deref_open_node - Get kernfs_open_node corresponding to @kn.
> @@ -903,18 +901,16 @@ static void kernfs_notify_workfn(struct work_struct *work)
> struct kernfs_node *kn;
> struct kernfs_super_info *info;
> struct kernfs_root *root;
> + struct llist_node *free;
> + struct kernfs_elem_attr *attr;
> repeat:
> /* pop one off the notify_list */
> - spin_lock_irq(&kernfs_notify_lock);
> - kn = kernfs_notify_list;
> - if (kn == KERNFS_NOTIFY_EOL) {
> - spin_unlock_irq(&kernfs_notify_lock);
> + free = llist_del_first(&kernfs_notify_list);
> + if (free == NULL)
> return;
> - }
> - kernfs_notify_list = kn->attr.notify_next;
> - kn->attr.notify_next = NULL;
> - spin_unlock_irq(&kernfs_notify_lock);
>
> + attr = llist_entry(free, struct kernfs_elem_attr, notify_next);
> + kn = attribute_to_node(attr, struct kernfs_node, attr);
> root = kernfs_root(kn);
> /* kick fsnotify */
> down_write(&root->kernfs_rwsem);
> @@ -970,12 +966,14 @@ static void kernfs_notify_workfn(struct work_struct *work)
> void kernfs_notify(struct kernfs_node *kn)
> {
> static DECLARE_WORK(kernfs_notify_work, kernfs_notify_workfn);
> - unsigned long flags;
> struct kernfs_open_node *on;
>
> if (WARN_ON(kernfs_type(kn) != KERNFS_FILE))
> return;
>
> + /* Because we are using llist for kernfs_notify_list */
> + WARN_ON_ONCE(in_nmi());
> +
> /* kick poll immediately */
> rcu_read_lock();
> on = rcu_dereference(kn->attr.open);
> @@ -986,14 +984,9 @@ void kernfs_notify(struct kernfs_node *kn)
> rcu_read_unlock();
>
> /* schedule work to kick fsnotify */
> - spin_lock_irqsave(&kernfs_notify_lock, flags);
> - if (!kn->attr.notify_next) {
> - kernfs_get(kn);
> - kn->attr.notify_next = kernfs_notify_list;
> - kernfs_notify_list = kn;
> - schedule_work(&kernfs_notify_work);
> - }
> - spin_unlock_irqrestore(&kernfs_notify_lock, flags);
> + kernfs_get(kn);
> + llist_add(&kn->attr.notify_next, &kernfs_notify_list);
> + schedule_work(&kernfs_notify_work);
> }
> EXPORT_SYMBOL_GPL(kernfs_notify);
>
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 13f54f078a52a..2dd9c8df0f4f6 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -116,7 +116,7 @@ struct kernfs_elem_attr {
> const struct kernfs_ops *ops;
> struct kernfs_open_node __rcu *open;
> loff_t size;
> - struct kernfs_node *notify_next; /* for kernfs_notify() */
> + struct llist_node notify_next; /* for kernfs_notify() */
> };
>
> /*

Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland