Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

From: Rik van Riel
Date: Wed Jan 12 2011 - 22:03:20 EST


On 01/07/2011 12:29 AM, Mike Galbraith wrote:

+#ifdef CONFIG_SMP
+ /*
+ * If this yield is important enough to want to preempt instead
+ * of only dropping a ->next hint, we're alone, and the target
+ * is not alone, pull the target to this cpu.
+ *
+ * NOTE: the target may be alone in it's cfs_rq if another class
+ * task or another task group is currently executing on it's cpu.
+ * In this case, we still pull, to accelerate it toward the cpu.
+ */
+ if (cfs_rq != p_cfs_rq&& preempt&& cfs_rq->nr_running == 1&&
+ cpumask_test_cpu(this_cpu,&p->cpus_allowed)) {
+ pull_task(task_rq(p), p, this_rq(), this_cpu);
+ p_cfs_rq = cfs_rq_of(pse);
+ }
+#endif

This causes some fun issues in a simple test case on
my system. The test consists of 2 4-VCPU KVM guests,
bound to the same 4 CPUs on the host.

One guest is running the AMQP performance test, the
other guest is totally idle. This means that besides
the 4 very busy VCPUs, there is only a few percent
CPU use in background tasks from the idle guest and
qemu-kvm userspace bits.

However, the busy guest is restricted to using just
3 out of the 4 CPUs, leaving one idle!

A simple explanation for this is that the above
pulling code will pull another VCPU onto the local
CPU whenever we run into contention inside the guest
and some random background task runs on the CPU where
that other VCPU was.

From that point on, the 4 VCPUs will stay on 3
CPUs, leaving one idle. Any time we have contention
inside the guest (pretty frequent), we move whoever
is not currently running to another CPU.

Cgroups only makes the matter worse - libvirt places
each KVM guest into its own cgroup, so a VCPU will
generally always be alone on its own per-cgroup, per-cpu
runqueue! That can lead to pulling a VCPU onto our local
CPU because we think we are alone, when in reality we
share the CPU with others...

Removing the pulling code allows me to use all 4
CPUs with a 4-VCPU KVM guest in an uncontended situation.

+ /* Tell the scheduler that we'd really like pse to run next. */
+ p_cfs_rq->next = pse;

Using set_next_buddy propagates this up to the root,
allowing the scheduler to actually know who we want to
run next when cgroups is involved.

+ /* We know whether we want to preempt or not, but are we allowed? */
+ if (preempt&& same_thread_group(p, task_of(p_cfs_rq->curr)))
+ resched_task(task_of(p_cfs_rq->curr));

With this in place, we can get into the situation where
we will gladly give up CPU time, but not actually give
any to the other VCPUs in our guest.

I believe we can get rid of that test, because pick_next_entity
already makes sure it ignores ->next if picking ->next would
lead to unfairness.

Removing this test (and simplifying yield_to_task_fair) seems
to lead to more predictable test results.

I'll send the updated patch in another email, since this one is
already way too long for a changelog :)

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/