Re: [GIT PULL] workqueue fixes for v4.3-rc5

From: Tejun Heo
Date: Wed Oct 14 2015 - 15:03:08 EST


Hello, Linus.

On Wed, Oct 14, 2015 at 10:36:50AM -0700, Linus Torvalds wrote:
> > For both delayed and !delayed work items on per-cpu workqueues,
> > queueing without specifying a specific CPU always meant queueing on
> > the local CPU.
>
> That's just not true, as far as I can tell.
>
> I went back in the history to 2012, and it does
>
> if (unlikely(cpu != WORK_CPU_UNBOUND))
> add_timer_on(timer, cpu);
> else
> add_timer(timer);
>
> so this whole WORK_CPU_UNBOUND means "add_timer()" without specifying
> a CPU has been true for at least the last several years.
>
> So the documentation, the code, and the name all agree:

But wasn't add_timer() always CPU-local at the time? add_timer()
allowing cross-cpu migrations came way after that.

commit 597d0275736dad9c3bda6f0a00a1c477dc0f37b1
Author: Arun R Bharadwaj <arun@xxxxxxxxxxxxxxxxxx>
Date: Thu Apr 16 12:13:26 2009 +0530

timers: Framework for identifying pinned timers

> WORK_CPU_UNBOUND does *not* mean that it's guaranteed to be the local
> CPU. The documentation says "preferred", the code clearly doesn't
> specify the CPU, and the name says "not bound to a particular CPU".

The name and documentation were mostly me being too hopeful and then
being lazy after realizing it wouldn't work.

> > I wish this were an ad-hoc thing but this has been the guaranteed
> > behavior all along. Switching the specific call-site to
> > queue_work_on() would still be a good idea tho.
>
> I really don't see who you say that it has been guaranteed behavior all along.

The behavior was just like that from the beginning. Workqueues were
always CPU-affine and timers too and we never properly distinguished
CPU affinity for the purpose of correctness from optimization.

> It clearly has not at all been guaranteed behavior. The fact that you
> had to change the code to do that should have made it clear.

It wasn't like that before the timer change.

> The code has *always* done that non-cpu-specific "add_timer()", as far
> as I can tell. Even back when that non-bound CPU was indicated by a
> negative CPU number, and the code did
>
> if (unlikely(cpu >= 0))
> add_timer_on(timer, cpu);
> else
> add_timer(timer);
>
> (that's from 2007, btw).

At the time, it was just an alternative way to writing.

if (cpu < 0)
cpu = raw_smp_processor_id();
add_timer_on(timer, cpu);

> So I really don't see your "guaranteed behavior" argument. It seems to
> be downright pure bullshit. The lack of a specific CPU has _always_
> (where "always" means "at least since 2007") meant "non-specific cpu",
> rather than "local cpu".
>
> If some particular interface ended up then actually using "local cpu"
> instead, that was neither guaranteed nor implied - it was just a
> random implementation detail, and shouldn't be something we guarantee
> at all.
>
> We strive to maintain user-space ABI issues even in the face of
> unintentional bugs and misfeatures. But we do *not* keep broken random
> in-kernel interfaces alive. We fix breakage and clean up code rather
> than say "some random in-kernel user expects broken behavior".
>
> And it seems very clear that WORK_CPU_UNBOUND does *not* mean "local
> cpu", and code that depends on it meaning local cpu is broken.
>
> Now, there are reasons why the *implementation* might want to choose
> the local cpu for things - avoiding waking up other cpu's
> unnecessarily with cross-cpu calls etc - but at the same time I think
> it's quite clear that mm/vmstat.c is simply broken in using a
> non-bound interface and then expecting/assuming a particular CPU.

I don't think I'm confused about the timer behavior. At any rate, for
!delayed work items, the failure to distinguish bound for correctness
and optimization has been for quite a while. I'd be happy to ditch
that but I don't think it'd be a good idea to do this at -rc5 w/o
auditing or debug mechanisms to detect errors.

> >> (That said, it's not obvious to me why we don't just specify the cpu
> >> in the work structure itself, and just get rid of the "use different
> >> functions to schedule the work" model. I think it's somewhat fragile
> >> how you can end up using the same work in different "CPU boundedness"
> >> models like this).
> >
> > Hmmm... you mean associating a delayed work item with the target
> > pool_workqueue on queueing and sharing the queueing paths for both
> > delayed and !delayed work items?
>
> I wouldn't necessarily even go that far. That's more of an
> implementation detail.
>
> I just wonder if we perhaps should add the CPU boundedness to the init
> stage, and hide it away in the work structure. So if you just want to
> do work (delayed or not), you'd continue to do
>
> INIT_DELAYED_WORK(&work, ...);
>
> schedule_delayed_work(&work, delay);
>
> but if you want a percpu thing, you'd do
>
> INIT_DELAYED_WORK_CPU(&work, cpu, ...);
>
> schedule_delayed_work(&work, delay);
>
> rather than have that "..work_on(cpu, &work, )" interface. The actual
> *implementation* could stay the same, we'd just hide the CPU bound in
> the work struct and use it as an implicit argument.
>
> But it's not a big deal. I'm just saying that the current interface
> obviously allows that confusion about whether a work is percpu or not,
> on a per-call-site basis.

I was thinking more along the line of introducing
queue_work_on_local() and friends but yeah we definitely need to
explicitly distinguish affinity for correctness and for performance.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/