Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

From: Mike Galbraith
Date: Fri Feb 22 2013 - 00:03:20 EST


On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote:
> On 02/21/2013 05:43 PM, Mike Galbraith wrote:
> > On Thu, 2013-02-21 at 17:08 +0800, Michael Wang wrote:
> >
> >> But is this patch set really cause regression on your Q6600? It may
> >> sacrificed some thing, but I still think it will benefit far more,
> >> especially on huge systems.
> >
> > We spread on FORK/EXEC, and will no longer will pull communicating tasks
> > back to a shared cache with the new logic preferring to leave wakee
> > remote, so while no, I haven't tested (will try to find round tuit) it
> > seems it _must_ hurt. Dragging data from one llc to the other on Q6600
> > hurts a LOT. Every time a client and server are cross llc, it's a huge
> > hit. The previous logic pulled communicating tasks together right when
> > it matters the most, intermittent load... or interactive use.
>
> I agree that this is a problem need to be solved, but don't agree that
> wake_affine() is the solution.

It's not perfect, but it's better than no countering force at all. It's
a relic of the dark ages, when affine meant L2, ie this cpu. Now days,
affine has a whole new meaning, L3, so it could be done differently, but
_some_ kind of opposing force is required.

> According to my understanding, in the old world, wake_affine() will only
> be used if curr_cpu and prev_cpu share cache, which means they are in
> one package, whatever search in llc sd of curr_cpu or prev_cpu, we won't
> have the chance to spread the task out of that package.

? affine_sd is the first domain spanning both cpus, that may be NODE.
True we won't ever spread in the wakeup path unless SD_WAKE_BALANCE is
set that is. Would be nice to be able to do that without shredding
performance.

Off the top of my pointy head, I can think of a way to _maybe_ improve
the "affine" wakeup criteria: Add a small (package size? and very fast)
FIFO queue to task struct, record waker/wakee relationship. If
relationship exists in that queue (rbtree), try to wake local, if not,
wake remote. The thought is to identify situations ala 1:N pgbench
where you really need to keep the load spread. That need arises when
the sum wakees + waker won't fit in one cache. True buddies would
always hit (hm, hit rate), always try to become affine where they
thrive. 1:N stuff starts missing when client count exceeds package
size, starts expanding it's horizons. 'Course you would still need to
NAK if imbalanced too badly, and let NUMA stuff NAK touching lard-balls
and whatnot. With a little more smarts, we could have happy 1:N, and
buddies don't have to chat through 2m thick walls to make 1:N scale as
well as it can before it dies of stupidity.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/