Re: [PATCH v4] erofs: replace erofs_unzipd workqueue with per-cpu threads

From: Sandeep Dhavale
Date: Wed Feb 08 2023 - 02:00:34 EST


On Mon, Feb 6, 2023 at 6:55 PM Gao Xiang <hsiangkao@xxxxxxxxxxxxxxxxx> wrote:
>
>
>
> On 2023/2/7 03:41, Sandeep Dhavale wrote:
> > On Mon, Feb 6, 2023 at 2:01 AM Gao Xiang <xiang@xxxxxxxxxx> wrote:
> >>
> >> Hi Sandeep,
> >>
> >> On Fri, Jan 06, 2023 at 07:35:01AM +0000, Sandeep Dhavale wrote:
> >>> Using per-cpu thread pool we can reduce the scheduling latency compared
> >>> to workqueue implementation. With this patch scheduling latency and
> >>> variation is reduced as per-cpu threads are high priority kthread_workers.
> >>>
> >>> The results were evaluated on arm64 Android devices running 5.10 kernel.
> >>>
> >>> The table below shows resulting improvements of total scheduling latency
> >>> for the same app launch benchmark runs with 50 iterations. Scheduling
> >>> latency is the latency between when the task (workqueue kworker vs
> >>> kthread_worker) became eligible to run to when it actually started
> >>> running.
> >>> +-------------------------+-----------+----------------+---------+
> >>> | | workqueue | kthread_worker | diff |
> >>> +-------------------------+-----------+----------------+---------+
> >>> | Average (us) | 15253 | 2914 | -80.89% |
> >>> | Median (us) | 14001 | 2912 | -79.20% |
> >>> | Minimum (us) | 3117 | 1027 | -67.05% |
> >>> | Maximum (us) | 30170 | 3805 | -87.39% |
> >>> | Standard deviation (us) | 7166 | 359 | |
> >>> +-------------------------+-----------+----------------+---------+
> >>>
> >>> Background: Boot times and cold app launch benchmarks are very
> >>> important to the android ecosystem as they directly translate to
> >>> responsiveness from user point of view. While erofs provides
> >>> a lot of important features like space savings, we saw some
> >>> performance penalty in cold app launch benchmarks in few scenarios.
> >>> Analysis showed that the significant variance was coming from the
> >>> scheduling cost while decompression cost was more or less the same.
> >>>
> >>> Having per-cpu thread pool we can see from the above table that this
> >>> variation is reduced by ~80% on average. This problem was discussed
> >>> at LPC 2022. Link to LPC 2022 slides and
> >>> talk at [1]
> >>>
> >>> [1] https://lpc.events/event/16/contributions/1338/
> >>>
> >>> Signed-off-by: Sandeep Dhavale <dhavale@xxxxxxxxxx>
> >>> ---
> >>> V3 -> V4
> >>> * Updated commit message with background information
> >>> V2 -> V3
> >>> * Fix a warning Reported-by: kernel test robot <lkp@xxxxxxxxx>
> >>> V1 -> V2
> >>> * Changed name of kthread_workers from z_erofs to erofs_worker
> >>> * Added kernel configuration to run kthread_workers at normal or
> >>> high priority
> >>> * Added cpu hotplug support
> >>> * Added wrapped kthread_workers under worker_pool
> >>> * Added one unbound thread in a pool to handle a context where
> >>> we already stopped per-cpu kthread worker
> >>> * Updated commit message
> >>
> >> I've just modified your v4 patch based on erofs -dev branch with
> >> my previous suggestion [1], but I haven't tested it.
> >>
> >> Could you help check if the updated patch looks good to you and
> >> test it on your side? If there are unexpected behaviors, please
> >> help update as well, thanks!
> > Thanks Xiang, I was working on the same. I see that you have cleaned it up.
> > I will test it and report/fix any problems.
> >
> > Thanks,
> > Sandeep.
>
> Thanks! Look forward to your test. BTW, we have < 2 weeks for 6.3, so I'd
> like to fix it this week so that we could catch 6.3 merge window.
>
>
> I've fixed some cpu hotplug errors as below and added to a branch for 0day CI
> testing.
>
Hi Xiang,
With this version of the patch I have tested
- Multiple device reboot test
- Cold App launch tests
- Cold App launch tests with cpu offline/online

All tests ran successfully and no issue was observed.

Thanks,
Sandeep.

> Thanks,
> Gao Xiang
>
> diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
> index 73198f494a6a..92a9e20948b0 100644
> --- a/fs/erofs/zdata.c
> +++ b/fs/erofs/zdata.c
> @@ -398,7 +398,7 @@ static inline void erofs_destroy_percpu_workers(void) {}
> static inline int erofs_init_percpu_workers(void) { return 0; }
> #endif
>
> -#if defined(CONFIG_HOTPLUG_CPU) && defined(EROFS_FS_PCPU_KTHREAD)
> +#if defined(CONFIG_HOTPLUG_CPU) && defined(CONFIG_EROFS_FS_PCPU_KTHREAD)
> static DEFINE_SPINLOCK(z_erofs_pcpu_worker_lock);
> static enum cpuhp_state erofs_cpuhp_state;
>
> @@ -408,7 +408,7 @@ static int erofs_cpu_online(unsigned int cpu)
>
> worker = erofs_init_percpu_worker(cpu);
> if (IS_ERR(worker))
> - return ERR_PTR(worker);
> + return PTR_ERR(worker);
>
> spin_lock(&z_erofs_pcpu_worker_lock);
> old = rcu_dereference_protected(z_erofs_pcpu_workers[cpu],
> @@ -428,7 +428,7 @@ static int erofs_cpu_offline(unsigned int cpu)
> spin_lock(&z_erofs_pcpu_worker_lock);
> worker = rcu_dereference_protected(z_erofs_pcpu_workers[cpu],
> lockdep_is_held(&z_erofs_pcpu_worker_lock));
> - rcu_assign_pointer(worker_pool.workers[cpu], NULL);
> + rcu_assign_pointer(z_erofs_pcpu_workers[cpu], NULL);
> spin_unlock(&z_erofs_pcpu_worker_lock);
>
> synchronize_rcu();
>
> >
> >>
> >> [1] https://lore.kernel.org/r/5e1b7191-9ea6-3781-7928-72ac4cd88591@xxxxxxxxxxxxxxxxx/
> >>
> >> Thanks,
> >> Gao Xiang
> >>
> >> From 2e87235abc745c0fef8e32abcd3a51546b4378ad Mon Sep 17 00:00:00 2001
> >> From: Sandeep Dhavale <dhavale@xxxxxxxxxx>
> >> Date: Mon, 6 Feb 2023 17:53:39 +0800
> >> Subject: [PATCH] erofs: add per-cpu threads for decompression
> >>
> >> Using per-cpu thread pool we can reduce the scheduling latency compared
> >> to workqueue implementation. With this patch scheduling latency and
> >> variation is reduced as per-cpu threads are high priority kthread_workers.
> >>
> >> The results were evaluated on arm64 Android devices running 5.10 kernel.
> >>
> >> The table below shows resulting improvements of total scheduling latency
> >> for the same app launch benchmark runs with 50 iterations. Scheduling
> >> latency is the latency between when the task (workqueue kworker vs
> >> kthread_worker) became eligible to run to when it actually started
> >> running.
> >> +-------------------------+-----------+----------------+---------+
> >> | | workqueue | kthread_worker | diff |
> >> +-------------------------+-----------+----------------+---------+
> >> | Average (us) | 15253 | 2914 | -80.89% |
> >> | Median (us) | 14001 | 2912 | -79.20% |
> >> | Minimum (us) | 3117 | 1027 | -67.05% |
> >> | Maximum (us) | 30170 | 3805 | -87.39% |
> >> | Standard deviation (us) | 7166 | 359 | |
> >> +-------------------------+-----------+----------------+---------+
> >>
> >> Background: Boot times and cold app launch benchmarks are very
> >> important to the android ecosystem as they directly translate to
> >> responsiveness from user point of view. While erofs provides
> >> a lot of important features like space savings, we saw some
> >> performance penalty in cold app launch benchmarks in few scenarios.
> >> Analysis showed that the significant variance was coming from the
> >> scheduling cost while decompression cost was more or less the same.
> >>
> >> Having per-cpu thread pool we can see from the above table that this
> >> variation is reduced by ~80% on average. This problem was discussed
> >> at LPC 2022. Link to LPC 2022 slides and
> >> talk at [1]
> >>
> >> [1] https://lpc.events/event/16/contributions/1338/
> >>
> >> Signed-off-by: Sandeep Dhavale <dhavale@xxxxxxxxxx>
> >> Signed-off-by: Gao Xiang <hsiangkao@xxxxxxxxxxxxxxxxx>
> >> ---
> >> fs/erofs/Kconfig | 18 +++++
> >> fs/erofs/zdata.c | 190 ++++++++++++++++++++++++++++++++++++++++++-----
> >> 2 files changed, 189 insertions(+), 19 deletions(-)
> >>
> >> diff --git a/fs/erofs/Kconfig b/fs/erofs/Kconfig
> >> index 85490370e0ca..704fb59577e0 100644
> >> --- a/fs/erofs/Kconfig
> >> +++ b/fs/erofs/Kconfig
> >> @@ -108,3 +108,21 @@ config EROFS_FS_ONDEMAND
> >> read support.
> >>
> >> If unsure, say N.
> >> +
> >> +config EROFS_FS_PCPU_KTHREAD
> >> + bool "EROFS per-cpu decompression kthread workers"
> >> + depends on EROFS_FS_ZIP
> >> + help
> >> + Saying Y here enables per-CPU kthread workers pool to carry out
> >> + async decompression for low latencies on some architectures.
> >> +
> >> + If unsure, say N.
> >> +
> >> +config EROFS_FS_PCPU_KTHREAD_HIPRI
> >> + bool "EROFS high priority per-CPU kthread workers"
> >> + depends on EROFS_FS_ZIP && EROFS_FS_PCPU_KTHREAD
> >> + help
> >> + This permits EROFS to configure per-CPU kthread workers to run
> >> + at higher priority.
> >> +
> >> + If unsure, say N.
> >> diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
> >> index 384f64292f73..73198f494a6a 100644
> >> --- a/fs/erofs/zdata.c
> >> +++ b/fs/erofs/zdata.c
> >> @@ -7,6 +7,8 @@
> >> #include "compress.h"
> >> #include <linux/prefetch.h>
> >> #include <linux/psi.h>
> >> +#include <linux/slab.h>
> >> +#include <linux/cpuhotplug.h>
> >>
> >> #include <trace/events/erofs.h>
> >>
> >> @@ -109,6 +111,7 @@ struct z_erofs_decompressqueue {
> >> union {
> >> struct completion done;
> >> struct work_struct work;
> >> + struct kthread_work kthread_work;
> >> } u;
> >> bool eio, sync;
> >> };
> >> @@ -341,24 +344,128 @@ static void z_erofs_free_pcluster(struct z_erofs_pcluster *pcl)
> >>
> >> static struct workqueue_struct *z_erofs_workqueue __read_mostly;
> >>
> >> -void z_erofs_exit_zip_subsystem(void)
> >> +#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
> >> +static struct kthread_worker __rcu **z_erofs_pcpu_workers;
> >> +
> >> +static void erofs_destroy_percpu_workers(void)
> >> {
> >> - destroy_workqueue(z_erofs_workqueue);
> >> - z_erofs_destroy_pcluster_pool();
> >> + struct kthread_worker *worker;
> >> + unsigned int cpu;
> >> +
> >> + for_each_possible_cpu(cpu) {
> >> + worker = rcu_dereference_protected(
> >> + z_erofs_pcpu_workers[cpu], 1);
> >> + rcu_assign_pointer(z_erofs_pcpu_workers[cpu], NULL);
> >> + if (worker)
> >> + kthread_destroy_worker(worker);
> >> + }
> >> + kfree(z_erofs_pcpu_workers);
> >> }
> >>
> >> -static inline int z_erofs_init_workqueue(void)
> >> +static struct kthread_worker *erofs_init_percpu_worker(int cpu)
> >> {
> >> - const unsigned int onlinecpus = num_possible_cpus();
> >> + struct kthread_worker *worker =
> >> + kthread_create_worker_on_cpu(cpu, 0, "erofs_worker/%u", cpu);
> >>
> >> - /*
> >> - * no need to spawn too many threads, limiting threads could minimum
> >> - * scheduling overhead, perhaps per-CPU threads should be better?
> >> - */
> >> - z_erofs_workqueue = alloc_workqueue("erofs_unzipd",
> >> - WQ_UNBOUND | WQ_HIGHPRI,
> >> - onlinecpus + onlinecpus / 4);
> >> - return z_erofs_workqueue ? 0 : -ENOMEM;
> >> + if (IS_ERR(worker))
> >> + return worker;
> >> + if (IS_ENABLED(CONFIG_EROFS_FS_PCPU_KTHREAD_HIPRI))
> >> + sched_set_fifo_low(worker->task);
> >> + else
> >> + sched_set_normal(worker->task, 0);
> >> + return worker;
> >> +}
> >> +
> >> +static int erofs_init_percpu_workers(void)
> >> +{
> >> + struct kthread_worker *worker;
> >> + unsigned int cpu;
> >> +
> >> + z_erofs_pcpu_workers = kcalloc(num_possible_cpus(),
> >> + sizeof(struct kthread_worker *), GFP_ATOMIC);
> >> + if (!z_erofs_pcpu_workers)
> >> + return -ENOMEM;
> >> +
> >> + for_each_online_cpu(cpu) { /* could miss cpu{off,on}line? */
> >> + worker = erofs_init_percpu_worker(cpu);
> >> + if (!IS_ERR(worker))
> >> + rcu_assign_pointer(z_erofs_pcpu_workers[cpu], worker);
> >> + }
> >> + return 0;
> >> +}
> >> +#else
> >> +static inline void erofs_destroy_percpu_workers(void) {}
> >> +static inline int erofs_init_percpu_workers(void) { return 0; }
> >> +#endif
> >> +
> >> +#if defined(CONFIG_HOTPLUG_CPU) && defined(EROFS_FS_PCPU_KTHREAD)
> >> +static DEFINE_SPINLOCK(z_erofs_pcpu_worker_lock);
> >> +static enum cpuhp_state erofs_cpuhp_state;
> >> +
> >> +static int erofs_cpu_online(unsigned int cpu)
> >> +{
> >> + struct kthread_worker *worker, *old;
> >> +
> >> + worker = erofs_init_percpu_worker(cpu);
> >> + if (IS_ERR(worker))
> >> + return ERR_PTR(worker);
> >> +
> >> + spin_lock(&z_erofs_pcpu_worker_lock);
> >> + old = rcu_dereference_protected(z_erofs_pcpu_workers[cpu],
> >> + lockdep_is_held(&z_erofs_pcpu_worker_lock));
> >> + if (!old)
> >> + rcu_assign_pointer(z_erofs_pcpu_workers[cpu], worker);
> >> + spin_unlock(&z_erofs_pcpu_worker_lock);
> >> + if (old)
> >> + kthread_destroy_worker(worker);
> >> + return 0;
> >> +}
> >> +
> >> +static int erofs_cpu_offline(unsigned int cpu)
> >> +{
> >> + struct kthread_worker *worker;
> >> +
> >> + spin_lock(&z_erofs_pcpu_worker_lock);
> >> + worker = rcu_dereference_protected(z_erofs_pcpu_workers[cpu],
> >> + lockdep_is_held(&z_erofs_pcpu_worker_lock));
> >> + rcu_assign_pointer(worker_pool.workers[cpu], NULL);
> >> + spin_unlock(&z_erofs_pcpu_worker_lock);
> >> +
> >> + synchronize_rcu();
> >> + if (worker)
> >> + kthread_destroy_worker(worker);
> >> + return 0;
> >> +}
> >> +
> >> +static int erofs_cpu_hotplug_init(void)
> >> +{
> >> + int state;
> >> +
> >> + state = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
> >> + "fs/erofs:online", erofs_cpu_online, erofs_cpu_offline);
> >> + if (state < 0)
> >> + return state;
> >> +
> >> + erofs_cpuhp_state = state;
> >> + return 0;
> >> +}
> >> +
> >> +static void erofs_cpu_hotplug_destroy(void)
> >> +{
> >> + if (erofs_cpuhp_state)
> >> + cpuhp_remove_state_nocalls(erofs_cpuhp_state);
> >> +}
> >> +#else /* !CONFIG_HOTPLUG_CPU || !CONFIG_EROFS_FS_PCPU_KTHREAD */
> >> +static inline int erofs_cpu_hotplug_init(void) { return 0; }
> >> +static inline void erofs_cpu_hotplug_destroy(void) {}
> >> +#endif
> >> +
> >> +void z_erofs_exit_zip_subsystem(void)
> >> +{
> >> + erofs_cpu_hotplug_destroy();
> >> + erofs_destroy_percpu_workers();
> >> + destroy_workqueue(z_erofs_workqueue);
> >> + z_erofs_destroy_pcluster_pool();
> >> }
> >>
> >> int __init z_erofs_init_zip_subsystem(void)
> >> @@ -366,10 +473,29 @@ int __init z_erofs_init_zip_subsystem(void)
> >> int err = z_erofs_create_pcluster_pool();
> >>
> >> if (err)
> >> - return err;
> >> - err = z_erofs_init_workqueue();
> >> + goto out_error_pcluster_pool;
> >> +
> >> + z_erofs_workqueue = alloc_workqueue("erofs_worker",
> >> + WQ_UNBOUND | WQ_HIGHPRI, num_possible_cpus());
> >> + if (!z_erofs_workqueue)
> >> + goto out_error_workqueue_init;
> >> +
> >> + err = erofs_init_percpu_workers();
> >> if (err)
> >> - z_erofs_destroy_pcluster_pool();
> >> + goto out_error_pcpu_worker;
> >> +
> >> + err = erofs_cpu_hotplug_init();
> >> + if (err < 0)
> >> + goto out_error_cpuhp_init;
> >> + return err;
> >> +
> >> +out_error_cpuhp_init:
> >> + erofs_destroy_percpu_workers();
> >> +out_error_pcpu_worker:
> >> + destroy_workqueue(z_erofs_workqueue);
> >> +out_error_workqueue_init:
> >> + z_erofs_destroy_pcluster_pool();
> >> +out_error_pcluster_pool:
> >> return err;
> >> }
> >>
> >> @@ -1305,11 +1431,17 @@ static void z_erofs_decompressqueue_work(struct work_struct *work)
> >>
> >> DBG_BUGON(bgq->head == Z_EROFS_PCLUSTER_TAIL_CLOSED);
> >> z_erofs_decompress_queue(bgq, &pagepool);
> >> -
> >> erofs_release_pages(&pagepool);
> >> kvfree(bgq);
> >> }
> >>
> >> +#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
> >> +static void z_erofs_decompressqueue_kthread_work(struct kthread_work *work)
> >> +{
> >> + z_erofs_decompressqueue_work((struct work_struct *)work);
> >> +}
> >> +#endif
> >> +
> >> static void z_erofs_decompress_kickoff(struct z_erofs_decompressqueue *io,
> >> int bios)
> >> {
> >> @@ -1324,9 +1456,24 @@ static void z_erofs_decompress_kickoff(struct z_erofs_decompressqueue *io,
> >>
> >> if (atomic_add_return(bios, &io->pending_bios))
> >> return;
> >> - /* Use workqueue and sync decompression for atomic contexts only */
> >> + /* Use (kthread_)work and sync decompression for atomic contexts only */
> >> if (in_atomic() || irqs_disabled()) {
> >> +#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
> >> + struct kthread_worker *worker;
> >> +
> >> + rcu_read_lock();
> >> + worker = rcu_dereference(
> >> + z_erofs_pcpu_workers[raw_smp_processor_id()]);
> >> + if (!worker) {
> >> + INIT_WORK(&io->u.work, z_erofs_decompressqueue_work);
> >> + queue_work(z_erofs_workqueue, &io->u.work);
> >> + } else {
> >> + kthread_queue_work(worker, &io->u.kthread_work);
> >> + }
> >> + rcu_read_unlock();
> >> +#else
> >> queue_work(z_erofs_workqueue, &io->u.work);
> >> +#endif
> >> /* enable sync decompression for readahead */
> >> if (sbi->opt.sync_decompress == EROFS_SYNC_DECOMPRESS_AUTO)
> >> sbi->opt.sync_decompress = EROFS_SYNC_DECOMPRESS_FORCE_ON;
> >> @@ -1455,7 +1602,12 @@ static struct z_erofs_decompressqueue *jobqueue_init(struct super_block *sb,
> >> *fg = true;
> >> goto fg_out;
> >> }
> >> +#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
> >> + kthread_init_work(&q->u.kthread_work,
> >> + z_erofs_decompressqueue_kthread_work);
> >> +#else
> >> INIT_WORK(&q->u.work, z_erofs_decompressqueue_work);
> >> +#endif
> >> } else {
> >> fg_out:
> >> q = fgq;
> >> @@ -1640,7 +1792,7 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f,
> >>
> >> /*
> >> * although background is preferred, no one is pending for submission.
> >> - * don't issue workqueue for decompression but drop it directly instead.
> >> + * don't issue decompression but drop it directly instead.
> >> */
> >> if (!*force_fg && !nr_bios) {
> >> kvfree(q[JQ_SUBMIT]);
> >> --
> >> 2.30.2
> >>