Re: [GIT PULL] fsverity fixes for v6.3-rc4

From: Tejun Heo
Date: Tue Mar 21 2023 - 02:06:00 EST


Hello,

(cc'ing Lai.)

On Mon, Mar 20, 2023 at 03:31:13PM -0700, Linus Torvalds wrote:
> On Mon, Mar 20, 2023 at 2:07 PM Eric Biggers <ebiggers@xxxxxxxxxx> wrote:
> >
> > Nathan Huckleberry (1):
> > fsverity: Remove WQ_UNBOUND from fsverity read workqueue
>
> There's a *lot* of other WQ_UNBOUND users. If it performs that badly,
> maybe there is something wrong with the workqueue code.
>
> Should people be warned to not use WQ_UNBOUND - or is there something
> very special about fsverity?
>
> Added Tejun to the cc. With one of the main documented reasons for
> WQ_UNBOUND being performance (both implicit "try to start execution of
> work items as soon as possible") and explicit ("CPU intensive
> workloads which can be better managed by the system scheduler"), maybe
> it's time to reconsider?
>
> WQ_UNBOUND adds a fair amount of complexity and special cases to the
> workqueues, and this is now the second "let's remove it because it's
> hurting things in a big way".

Do you remember what the other case was? Was it also on heterogenous arm
setup?

There aren't many differences between unbound workqueues and percpu ones
that aren't concurrency managed. If there are significant performance
differences, it's unlikely to be directly from whatever workqueue is doing.

One obvious thing that comes to mind is that WQ_UNBOUND may be pushing tasks
across expensive cache boundaries (e.g. across cores that are living on
separate L3 complexes). This isn't a totally new problem and workqueue has
some topology awareness, by default, WQ_UNBOUND pools are segregated across
NUMA boundaries. This used to be fine but I think it's likely outmoded now.
given that non-trivial cache hierarchies on top of UMA or inside a node are
a thing these days.

Looking at f959325e6ac3 ("fsverity: Remove WQ_UNBOUND from fsverity read
workqueue"), I feel a bit uneasy. This would be fine on a setup which does
moderate amount of IOs on CPUs with quick enough accelration mechanisms, but
that's not the whole world. Use cases that generate extreme amount of IOs do
depend on the ability to fan out IO related work items across multiple CPUs
especially if the IOs coincide with network activities. So, my intuition is
that the commit is fixing a subset of use cases while likely regressing
others.

If the cache theory is correct, the right thing to do would be making
workqueue init code a bit smarter so that it segements unbound pools on LLC
boundaries rather than NUMA, which would make more sense on recent AMD chips
too. Nathan, can you run `hwloc-ls` on the affected setup (or `lstopo
out.pdf`) and attach the output?

As for the overhead of supporting WQ_UNBOUND, it does add non-trivial amount
of complexity but of the boring kind. It's all managerial stuff which isn't
too difficult to understand and relatively easy to understand and fix when
something goes wrong, so it isn't expensive in terms of supportability and
it does address classes of significant use cases, so I think we should just
fix it.

Thanks.

--
tejun