[PATCH v1] sched: steer waking task to empty cfs_rq for betterlatencies

From: Srivatsa Vaddagiri
Date: Tue Apr 24 2012 - 12:56:29 EST


During my investigation of a performance issue, I found that
we can do a better job of reducing latencies for a waking task by
steering it towards a cpu where it will get better sleeper credits.

Consider a system with two nodes: N0-N1, each with 4 cpus and a cgroup
/a which is of highest priority. Further all 4 cpus in a node are in the same
llc (MC) domain.

N0 N1
(0,1,2,3) (4,5,6,7)

rq.nr_run -> 2 1 1 1 2 2 1 1
/a cfs_rq.nr_run -> 0 0 0 0 0 0 0 1

Consider a task of "/a" waking up after a short (< sysctl_sched_latency) sleep.
Its prev_cpu was 7. select_idle_sibling(), failing to find a idle core, simply
wakes up the task on CPU7, where it may be unable to preempt the
currently running task (as its new vruntime is not sufficiently behind
currently running tasks vruntime - owing to the short sleep it
incurred). As a result, the task woken up is unable to run immediately
and thus incurs some latency.

A better choice would be to find a cpu in cpu7's MC domain where its
cgroup has 0 tasks (thus allowing the waking task to get better sleeper
credits).

Patch below implements this idea. Some results with various benchmarks
is enclosed.

Machine : 2 Quad-core Intel X5570 CPU w/ H/T enabled (16 cpus)
Kernel : tip (HEAD at 2adb096)
guest VM : 2.6.18 linux kernel based enterprise guest

Benchmarks are run in two scenarios:

1. BM -> Bare Metal. Benchmark is run on bare metal in root cgroup
2. VM -> Benchmark is run inside a guest VM. Several cpu hogs (in
various cgroups) are run on host. Cgroup setup is as below:

/libvirt/qemu/VM (cpu.shares = 8192. guest VM w/ 8 vcpus)
/libvirt/qemu/hoga[bcd] (cpu.shares = 1024. hosts 4 cpu hogs each)

Mean and std. dev. (in brackets) for both tip and tip+patch cases provided
below:

BM scenario:
tip tip+patch Remarks
mean(std. dev) mean (std. dev)

volano 1 (6.5%) 0.97 (4.7%) 3% loss
sysbench [n1] 1 (0.6%) 1.004 (0.7%) 0.4% win
tbench 1 [n2] 1 (2%) 1.024 (1.6%) 2.4% win
pipe bench [n3] 1 (5.5%) 1.009 (2.5%) 0.9% win

VM scenario

sysbench [n4] 1 (1.2%) 2.21 (1.3%) 121% win
httperf [n5] 1 (5.7%) 1.522 (6%) 52.2% win
tbench 8 [n6] 1 (3.1%) 1.91 (6.4%) 91% win
volano 1 (4.3%) 1.06 (2.8%) 6% win
Trade 1 1.94 94% win


Notes:

n1. sysbench was run with 16 threads.
n2. tbench was run on localhost with 1 client
n3. ops/sec metric from pipe bench captured. pipe bench run as:
perf stat --repeat 10 --null perf bench sched pipe
n4. sysbench was run (inside VM) with 8 threads.
n5. httperf was run as with burst-length of 100 and wsess of 100,500,0.
Webserver was running inside VM while benchmark was run on a
physically different host.
n6. tbench was run over network with 8 clients

This is an improved version of the patch previously published that
minimizes/avoids regressions seen earlier:

https://lkml.org/lkml/2012/3/22/220

Comments/flames wellcome!


--

Steer a waking task towards a cpu where its cgroup has zero tasks (in
order to provide it better sleeper credits and hence reduce its wakeup
latency).

Signed-off-by: Srivatsa Vaddagiri <vatsa@xxxxxxxxxxxxxxxxxx>

---
kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)

Index: current/kernel/sched/fair.c
===================================================================
--- current.orig/kernel/sched/fair.c
+++ current/kernel/sched/fair.c
@@ -2459,6 +2459,32 @@ static long effective_load(struct task_g

return wl;
}
+
+/*
+ * Look for a CPU within @target's MC domain where the task's cgroup has
+ * zero tasks in its cfs_rq.
+ */
+static __always_inline int
+select_idle_cfs_rq(struct task_struct *p, int target)
+{
+ struct cpumask tmpmask;
+ struct task_group *tg = task_group(p);
+ struct sched_domain *sd;
+ int i;
+
+ if (tg == &root_task_group)
+ return target;
+
+ sd = rcu_dereference(per_cpu(sd_llc, target));
+ cpumask_and(&tmpmask, sched_domain_span(sd), tsk_cpus_allowed(p));
+ for_each_cpu(i, &tmpmask) {
+ if (!tg->cfs_rq[i]->nr_running)
+ return i;
+ }
+
+ return target;
+}
+
#else

static inline unsigned long effective_load(struct task_group *tg, int cpu,
@@ -2467,6 +2493,12 @@ static inline unsigned long effective_lo
return wl;
}

+static __always_inline int
+select_idle_cfs_rq(struct task_struct *p, int target)
+{
+ return target;
+}
+
#endif

static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -2677,6 +2709,13 @@ next:
sg = sg->next;
} while (sg != sd->groups);
}
+
+ /*
+ * Look for the next best possibility - a cpu where this task gets
+ * (better) sleeper credits.
+ */
+ target = select_idle_cfs_rq(p, target);
+
done:
return target;
}







--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/