Re: [PATCH] sched: Revert 14a40ffccd61 ("sched: replacePF_THREAD_BOUND with PF_NO_SETAFFINITY")

From: Tejun Heo
Date: Mon Dec 02 2013 - 11:56:07 EST

Next message: Nicolas Ferre: "[GIT PULL] at91: cleanup for 3.14 #1 (aka move to CCF)"
Previous message: Eric Dumazet: "Re: [PATCH 1/4] pch_gbe: Fix transmit queue management"
In reply to: Ingo Molnar: "Re: [PATCH] sched: Revert 14a40ffccd61 ("sched: replacePF_THREAD_BOUND with PF_NO_SETAFFINITY")"
Next in thread: Ingo Molnar: "Re: [PATCH] sched: Revert 14a40ffccd61 ("sched: replacePF_THREAD_BOUND with PF_NO_SETAFFINITY")"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello, guys.

On Mon, Dec 02, 2013 at 04:37:46PM +0100, Ingo Molnar wrote:
> > - We should strive to never consciously place artificial limitations on
> > kernel functionality; our main reason to place limitations should be
> > correctness.

It is a correctness issue tho. We'd be promising, say, NUMA affinity
and then possibly give out workers which aren't affine to the NUMA
node. Workqueue provides an API to set random affinity and its users
can rightfully expect the configured affinity and code accordingly.

> > There are cases where limiting affinity is justified: for example the
> > case of single cpu workqueue threads, which are special for their
> > limited concurrency, esp. when coupled with per-cpu resources --
> > allowing such threads to run on other cpus is a correctness violation
> > and can crash the kernel.

So can any requested affinity if the wq user configured it. I think
you're too focused on the binary bound / unbound distinction. Things
are a lot more generic now.

> > - But using it outside this case is overly broad; it dis-allows usage
> > that is functionally fine and in some cases desired.
> >
> > In particular; tj argues ( http://lkml.kernel.org/r/20131128143848.GD3925@xxxxxxxxxxxxxx )
> >
> > "That's just inviting people to do weirdest things and then
> > reporting things like "crypt jobs on some of our 500 machines end up
> > stuck on a single cpu once in a while" which will eventually be
> > tracked down to some weird shell script setting affinity on workers
> > doing something else."
> >
> > While that is exactly what HPC/RT people _want_ in order to avoid
> > disturbing the other CPUs with said crypt work.

Hmmm... I think you're missing the point there. The point is that you
can't really do that reliably by meddling with kworkers from userland
directly. It's gonna be fragile and dangerous. Userland would have
to continously poll for new kworkers while kernel workqueue users
would get their affinity expectations broken. I don't think that's
what HPC/RT people want. If HPC/RT people want to limit the cpus that
unbound workqueues without specific affinity configured can run on,
let's please implement that properly.

> > - Furthermore, the above example is also wrong in that you should not
> > protect root from itself; there's plenty root can do to shoot his
> > own foot off, let alone shoot his head off.
> >
> > Setting affinities is limited to root, and if root messes up the
> > system he can keep the pieces. But limiting in an overly broad
> > fashion stifles the functionality of the system.

This is the same problem with bound workers. Userland could be
actively breaking configured affinities of kworkers which may be
depended upon by its users and there's no way for userland to tell
which kworker is gonna serve which workqueue - there's no fixed
relationship between them, so it's not like userland can, say, modify
affinities on crypto kworkers in isolation. Both userland and kernel
wouldn't have much idea of the impact of such fiddling.

> > - Lastly; the flag actually blocks entry into cgroups as well asn
> > sched_setaffinity(), so the name is misleading at best.

Yeah, the name could be better.

> > The right fix is to only set PF_THREAD_BOUND on per-cpu workers:
> >
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> >
> > set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> >
> > - /* prevent userland from meddling with cpumask of workqueue workers */
> > - worker->task->flags |= PF_NO_SETAFFINITY;
> > -
> > /*
> > * The caller is responsible for ensuring %POOL_DISASSOCIATED
> > * remains stable across this function. See the comments above the
> > * flag definition for details.
> > */
> > - if (pool->flags & POOL_DISASSOCIATED)
> > + if (pool->flags & POOL_DISASSOCIATED) {
> > worker->flags |= WORKER_UNBOUND;
> > + } else {
> > + /*
> > + * Prevent userland from meddling with cpumask of workqueue
> > + * workers:
> > + */
> > + worker->task->flags |= PF_THREAD_BOUND;
> > + }
> >
> > Therefore revert the patch and add an annoying but non-destructive warning
> > check against abuse.
>
> Hm, I missed these problems with the approach, but I think you are
> right.
>
> Tejun, I suspect you concur with Peter's analysis, can I also add

So, not really.

> Peter's workqueue.c fixlet above to workqueue.c to this patch plus
> your Acked-by, so that the two changes are together?

The above would trigger the added warning if a new kworker is created
for a per-cpu workqueue while the cpu is down, which may happen, so
the patch itself is somewhat broken.

I don't think this is the right path to accomodate HPC/RT use cases.
Let's please implement something proper if keeping unbound kworkers
off certain cpus is necessary.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Nicolas Ferre: "[GIT PULL] at91: cleanup for 3.14 #1 (aka move to CCF)"
Previous message: Eric Dumazet: "Re: [PATCH 1/4] pch_gbe: Fix transmit queue management"
In reply to: Ingo Molnar: "Re: [PATCH] sched: Revert 14a40ffccd61 ("sched: replacePF_THREAD_BOUND with PF_NO_SETAFFINITY")"
Next in thread: Ingo Molnar: "Re: [PATCH] sched: Revert 14a40ffccd61 ("sched: replacePF_THREAD_BOUND with PF_NO_SETAFFINITY")"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]