Re: [RFC][PATCH] Improving directed yield scalability for PLEhandler

From: Andrew Theurer
Date: Thu Sep 13 2012 - 17:31:04 EST


On Thu, 2012-09-13 at 17:18 +0530, Raghavendra K T wrote:
> * Andrew Theurer <habanero@xxxxxxxxxxxxxxxxxx> [2012-09-11 13:27:41]:
>
> > On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote:
> > > On 09/11/2012 01:42 AM, Andrew Theurer wrote:
> > > > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote:
> > > >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
> > > >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p)
> > > >>>> +{
> > > >>>> + if (!curr->sched_class->yield_to_task)
> > > >>>> + return false;
> > > >>>> +
> > > >>>> + if (curr->sched_class != p->sched_class)
> > > >>>> + return false;
> > > >>>
> > > >>>
> > > >>> Peter,
> > > >>>
> > > >>> Should we also add a check if the runq has a skip buddy (as pointed out
> > > >>> by Raghu) and return if the skip buddy is already set.
> > > >>
> > > >> Oh right, I missed that suggestion.. the performance improvement went
> > > >> from 81% to 139% using this, right?
> > > >>
> > > >> It might make more sense to keep that separate, outside of this
> > > >> function, since its not a strict prerequisite.
> > > >>
> > > >>>>
> > > >>>> + if (task_running(p_rq, p) || p->state)
> > > >>>> + return false;
> > > >>>> +
> > > >>>> + return true;
> > > >>>> +}
> > > >>
> > > >>
> > > >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
> > > >>> bool preempt)
> > > >>>> rq = this_rq();
> > > >>>>
> > > >>>> again:
> > > >>>> + /* optimistic test to avoid taking locks */
> > > >>>> + if (!__yield_to_candidate(curr, p))
> > > >>>> + goto out_irq;
> > > >>>> +
> > > >>
> > > >> So add something like:
> > > >>
> > > >> /* Optimistic, if we 'raced' with another yield_to(), don't bother */
> > > >> if (p_rq->cfs_rq->skip)
> > > >> goto out_irq;
> > > >>>
> > > >>>
> > > >>>> p_rq = task_rq(p);
> > > >>>> double_rq_lock(rq, p_rq);
> > > >>>
> > > >>>
> > > >> But I do have a question on this optimization though,.. Why do we check
> > > >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?
> > > >>
> > > >> That is, I'd like to see this thing explained a little better.
> > > >>
> > > >> Does it go something like: p_rq is the runqueue of the task we'd like to
> > > >> yield to, rq is our own, they might be the same. If we have a ->skip,
> > > >> there's nothing we can do about it, OTOH p_rq having a ->skip and
> > > >> failing the yield_to() simply means us picking the next VCPU thread,
> > > >> which might be running on an entirely different cpu (rq) and could
> > > >> succeed?
> > > >
> > > > Here's two new versions, both include a __yield_to_candidate(): "v3"
> > > > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq
> > > > skip check. Raghu, I am not sure if this is exactly what you want
> > > > implemented in v4.
> > > >
> > >
> > > Andrew, Yes that is what I had. I think there was a mis-understanding.
> > > My intention was to if there is a directed_yield happened in runqueue
> > > (say rqA), do not bother to directed yield to that. But unfortunately as
> > > PeterZ pointed that would have resulted in setting next buddy of a
> > > different run queue than rqA.
> > > So we can drop this "skip" idea. Pondering more over what to do? can we
> > > use next buddy itself ... thinking..
> >
> > As I mentioned earlier today, I did not have your changes from kvm.git
> > tree when I tested my changes. Here are your changes and my changes
> > compared:
> >
> > throughput in MB/sec
> >
> > kvm_vcpu_on_spin changes: 4636 +/- 15.74%
> > yield_to changes: 4515 +/- 12.73%
> >
> > I would be inclined to stick with your changes which are kept in kvm
> > code. I did try both combined, and did not get good results:
> >
> > both changes: 4074 +/- 19.12%
> >
> > So, having both is probably not a good idea. However, I feel like
> > there's more work to be done. With no over-commit (10 VMs), total
> > throughput is 23427 +/- 2.76%. A 2x over-commit will no doubt have some
> > overhead, but a reduction to ~4500 is still terrible. By contrast,
> > 8-way VMs with 2x over-commit have a total throughput roughly 10% less
> > than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread
> > host). We still have what appears to be scalability problems, but now
> > it's not so much in runqueue locks for yield_to(), but now
> > get_pid_task():
> >
>
> Hi Andrew,
> IMHO, reducing the double runqueue lock overhead is a good idea,
> and may be we see the benefits when we increase the overcommit further.
>
> The explaination for not seeing good benefit on top of PLE handler
> optimization patch is because we filter the yield_to candidates,
> and hence resulting in less contention for double runqueue lock.
> and extra code overhead during genuine yield_to might have resulted in
> some degradation in the case you tested.
>
> However, did you use cfs.next also?. I hope it helps, when we combine.
>
> Here is the result that is showing positive benefit.
> I experimented on a 32 core (no HT) PLE machine with 32 vcpu guest(s).
>
> +-----------+-----------+-----------+------------+-----------+
> kernbench time in sec, lower is better
> +-----------+-----------+-----------+------------+-----------+
> base stddev patched stddev %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x 44.3880 1.8699 40.8180 1.9173 8.04271
> 2x 96.7580 4.2787 93.4188 3.5150 3.45108
> +-----------+-----------+-----------+------------+-----------+
>
>
> +-----------+-----------+-----------+------------+-----------+
> ebizzy record/sec higher is better
> +-----------+-----------+-----------+------------+-----------+
> base stddev patched stddev %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x 2374.1250 50.9718 3816.2500 54.0681 60.74343
> 2x 2536.2500 93.0403 2789.3750 204.7897 9.98029
> +-----------+-----------+-----------+------------+-----------+
>
>
> Below is the patch which combine suggestions of peterZ on your
> original approach with cfs.next (already posted by Srikar in the other
> thread)

I did get a chance to run with the below patch and your changes in
kvm.git, but the results were not too different:

Dbench, 10 x 16-way VMs on 80-way host:

kvm_vcpu_on_spin changes: 4636 +/- 15.74%
yield_to changes: 4515 +/- 12.73%
both changes from above: 4074 +/- 19.12%
...plus cfs.next check: 4418 +/- 16.97%

Still hovering around 4500 MB/sec

The concern I have is that even though we have gone through changes to
help reduce the candidate vcpus we yield to, we still have a very poor
idea of which vcpu really needs to run. The result is high cpu usage in
the get_pid_task and still some contention in the double runqueue lock.
To make this scalable, we either need to significantly reduce the
occurrence of the lock-holder preemption, or do a much better job of
knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
which do not need to run).

On reducing the occurrence: The worst case for lock-holder preemption
is having vcpus of same VM on the same runqueue. This guarantees the
situation of 1 vcpu running while another [of the same VM] is not. To
prove the point, I ran the same test, but with vcpus restricted to a
range of host cpus, such that any single VM's vcpus can never be on the
same runqueue. In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
vcpu-1's are on host cpus 5-9, and so on. Here is the result:

kvm_cpu_spin, and all
yield_to changes, plus
restricted vcpu placement: 8823 +/- 3.20% much, much better

On picking a better vcpu to yield to: I really hesitate to rely on
paravirt hint [telling us which vcpu is holding a lock], but I am not
sure how else to reduce the candidate vcpus to yield to. I suspect we
are yielding to way more vcpus than are prempted lock-holders, and that
IMO is just work accomplishing nothing. Trying to think of way to
further reduce candidate vcpus....


-Andrew


>
> ----8<----
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fbf1fd0..8551f57 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4820,6 +4820,24 @@ void __sched yield(void)
> }
> EXPORT_SYMBOL(yield);
>
> +/*
> + * Tests preconditions required for sched_class::yield_to().
> + */
> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p,
> + struct rq *p_rq)
> +{
> + if (!curr->sched_class->yield_to_task)
> + return false;
> +
> + if (curr->sched_class != p->sched_class)
> + return false;
> +
> + if (task_running(p_rq, p) || p->state)
> + return false;
> +
> + return true;
> +}
> +
> /**
> * yield_to - yield the current processor to another thread in
> * your thread group, or accelerate that thread toward the
> @@ -4844,20 +4862,24 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>
> again:
> p_rq = task_rq(p);
> +
> + /* optimistic test to avoid taking locks */
> + if (!__yield_to_candidate(curr, p, p_rq))
> + goto out_irq;
> +
> + /* if next buddy is set, assume yield is in progress */
> + if (p_rq->cfs.next)
> + goto out_irq;
> +
> double_rq_lock(rq, p_rq);
> while (task_rq(p) != p_rq) {
> double_rq_unlock(rq, p_rq);
> goto again;
> }
>
> - if (!curr->sched_class->yield_to_task)
> - goto out;
> -
> - if (curr->sched_class != p->sched_class)
> - goto out;
> -
> - if (task_running(p_rq, p) || p->state)
> - goto out;
> + /* validate state, holding p_rq ensures p's state cannot change */
> + if (!__yield_to_candidate(curr, p, p_rq))
> + goto out_unlock;
>
> yielded = curr->sched_class->yield_to_task(rq, p, preempt);
> if (yielded) {
> @@ -4877,8 +4899,9 @@ again:
> rq->skip_clock_update = 0;
> }
>
> -out:
> +out_unlock:
> double_rq_unlock(rq, p_rq);
> +out_irq:
> local_irq_restore(flags);
>
> if (yielded)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/