Re: [patch 4/4] sched: Distangle worker accounting from rq->lock

From: Tejun Heo
Date: Sun Jun 26 2011 - 06:20:13 EST

Next message: Tejun Heo: "Re: 32bit NUMA and fakeNUMA broken for AMD CPUs"
Previous message: Eric Dumazet: "Re: 2.6.39.2 BUG: unable to handle kernel NULL pointer dereference"
In reply to: Thomas Gleixner: "Re: [patch 4/4] sched: Distangle worker accounting from rq->lock"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello, Thomas.

On Fri, Jun 24, 2011 at 11:01:12AM +0200, Thomas Gleixner wrote:
> The whole design is completely plugged into the scheduler, as you
> _abuse_ rq->lock to serialize the accounting and allow the unlocked
> access to the idle_list. It does not matter at all whether that code
> path is fast or not, it simply does not belong there.

It didn't happen like that. Hooking into scheduler was the one of the
biggest issues and there were several different attempts at using /
generalizing existing hooks somehow which took good six months without
much agreement from scheduler side.

I don't recall all the details now but IIRC it tried to use
fire_sched_out_preempt_notifiers() which was called inside rq lock
which introduced all the convolution with rq lock. Later somebody
suggested just open coding it and being done with it and that's how it
ended up inside rq lock. Hindsight is 20/20 and we could have pulled
it out of rq lock at that time but the rest of the code has been
already running with various different hooking schemes for quite some
time by then.

Anyways, so, let's clean it up.

> The only reason why you want this precise accounting thing is a corner
> case of system overload, but not the common case. In case of system
> overload you can do extra work to figure out how many of these beasts
> are on the fly and whether it makes sense to start some more, simply
> because it does not matter in a overload situation whether you run
> through a list to gather information.

Sure, there are multiple ways to resolve the situation. I was mostly
worried about waking up / attempting to start more workers under CPU
pressure (this is fine under memory pressure as CPU is idle and memory
allocation failure / blocking would quickly recall rescuers). I've
been looking at the code and am still not sure but it could be that
this isn't really something to worry about.

Workqueue code always wakes up the first idle worker and a worker
leaves idle state not by its waker but on its own accord when it start
running, which means that, with the suggested change, there can be
multiple spurious wake ups but they'll all try to wake up the same
idle worker which hasn't started running yet. We might still
unnecessarily try to summon rescuers tho. It needs more
considerations and experimentations.

Another concern is the additional spin_lock in the sleeping path to
synchronize against CPU unplug operation I mentioned in the other
reply. Another way to do it would be using preempt_disable/enable()
instead and calling synchronize_sched() from trustee. It would only
add an extra preemption flipping compared to the current code.

> Can we please stop adding stuff to the most crucial code pathes in the
> kernel just to avoid thinking a bit harder about problems?

Heh... sure, let's just keep it technical, okay?

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Tejun Heo: "Re: 32bit NUMA and fakeNUMA broken for AMD CPUs"
Previous message: Eric Dumazet: "Re: 2.6.39.2 BUG: unable to handle kernel NULL pointer dereference"
In reply to: Thomas Gleixner: "Re: [patch 4/4] sched: Distangle worker accounting from rq->lock"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]