Re: [GIT PULL] fsverity fixes for v6.3-rc4

From: Tejun Heo
Date: Wed Mar 22 2023 - 21:04:35 EST


Hello, Linus.

On Tue, Mar 21, 2023 at 11:31:35AM -0700, Linus Torvalds wrote:
> On Mon, Mar 20, 2023 at 11:05 PM Tejun Heo <tj@xxxxxxxxxx> wrote:
> >
> > Do you remember what the other case was? Was it also on heterogenous arm
> > setup?
>
> Yup. See commit c25da5b7baf1 ("dm verity: stop using WQ_UNBOUND for verify_wq")
>
> But see also 3fffb589b9a6 ("erofs: add per-cpu threads for
> decompression as an option").
>
> And you can see the confusion this all has in commit 43fa47cb116d ("dm
> verity: remove WQ_CPU_INTENSIVE flag since using WQ_UNBOUND"), which
> perhaps should be undone now.

Thanks for the pointers. They all seem plausible symptoms of work items
getting bounced across slow cache boundaries. I'm off for a few weeks so
can't really dig in right now but will get to it afterwards.

> > There aren't many differences between unbound workqueues and percpu ones
> > that aren't concurrency managed. If there are significant performance
> > differences, it's unlikely to be directly from whatever workqueue is doing.
>
> There's a *lot* of special cases for WQ_UNBOUND in the workqueue code,
> and they are a lot less targeted than the other WQ_xyz flags, I feel.
> They have their own cpumask logic, special freeing rules etc etc.
>
> So I would say that the "aren't many differences" is not exactly true.
> There are subtle and random differences, including the very basic
> "queue_work()" workflow.

Oh yeah, pwq management side is pretty involved, being dynamic and all. I
just couldn't think of anything in the issue & execution path which would
explain the reported significant performance penalty. The issue path
differences come down to node selection and dynamic pwq release handling,
neither of which should be in play in this case.

> Now, I assume that the arm cases don't actually use
> wq_unbound_cpumask, so I assume it's mostly the "instead of local cpu
> queue, use the local node queue", and so it's all on random CPU's
> since nobody uses NUMA nodes.
>
> And no, if it's caching effects, doing it on LLC boundaries isn't
> rigth *either*. By default it should probably be on L2 boundaries or
> something, with most non-NUMA setups likely having one single LLC but
> multiple L2 nodes.

Hmm... on recent x86 cpus, that'd just end up paring up the hyperthreads,
which would likely be too narrow especially given that l3's on recent cpus
seem pretty fast. I think what I need to do is generalizing the numa logic
so that it can sit on any of these topological boundaries and let arch
define the default boundary and let each wq to override the selection.

Another related shortcoming is that while the unbound wq's say "let the
scheduler figure out the best solution within the boundary", they don't
communicate the locality of work item to the scheduler at all, so within
each boundary, from scheduler pov, the assignment is completely random. Down
the line, it'd probably help if wq can provide some hints re. the expected
locality.

Thanks.

--
tejun