Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenariosin PLE handler

From: Avi Kivity
Date: Mon Sep 24 2012 - 12:20:22 EST


On 09/24/2012 06:03 PM, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 17:51 +0200, Avi Kivity wrote:
>> On 09/24/2012 03:54 PM, Peter Zijlstra wrote:
>> > On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
>> >> However Rik had a genuine concern in the cases where runqueue is not
>> >> equally distributed and lockholder might actually be on a different run
>> >> queue but not running.
>> >
>> > Load should eventually get distributed equally -- that's what the
>> > load-balancer is for -- so this is a temporary situation.
>>
>> What's the expected latency? This is the whole problem. Eventually the
>> scheduler would pick the lock holder as well, the problem is that it's
>> in the millisecond scale while lock hold times are in the microsecond
>> scale, leading to a 1000x slowdown.
>
> Yeah I know.. Heisenberg's uncertainty applied to SMP computing becomes
> something like accurate or fast, never both.
>
>> If we want to yield, we really want to boost someone.
>
> Now if only you knew which someone ;-) This non-modified guest nonsense
> is such a snake pit.. but you know how I feel about all that.

Actually if I knew that in addition to boosting someone, I also unboost
myself enough to be preempted, it wouldn't matter. While boosting the
lock holder is good, the main point is not spinning and doing useful
work instead. We can detect spinners and avoid boosting them.

That's the motivation for the "donate vruntime" approach I wanted earlier.

>
>> > We already try and favour the non running vcpu in this case, that's what
>> > yield_to_task_fair() is about. If its still not eligible to run, tough
>> > luck.
>>
>> Crazy idea: instead of yielding, just run that other vcpu in the thread
>> that would otherwise spin. I can see about a million objections to this
>> already though.
>
> Yah.. you want me to list a few? :-) It would require synchronization
> with the other cpu to pull its task -- one really wants to avoid it also
> running it.

Yeah, it's quite a horrible idea.

>
> Do this at a high enough frequency and you're dead too.
>
> Anyway, you can do this inside the KVM stuff, simply flip the vcpu state
> associated with a vcpu thread and use the preemption notifiers to sort
> things against the scheduler or somesuch.

That's what I thought when I wrote this, but I can't, I might be
preempted in random kvm code. So my state includes the host stack and
registers. Maybe we can special-case when we interrupt guest mode.

>
>> >> Do you think instead of using rq->nr_running, we could get a global
>> >> sense of load using avenrun (something like avenrun/num_onlinecpus)
>> >
>> > To what purpose? Also, global stuff is expensive, so you should try and
>> > stay away from it as hard as you possibly can.
>>
>> Spinning is also expensive. How about we do the global stuff every N
>> times, to amortize the cost (and reduce contention)?
>
> Nah, spinning isn't expensive, its a waste of time, similar end result
> for someone who wants to do useful work though, but not the same cause.
>
> Pick N and I'll come up with a scenario for which its wrong ;-)

Sure. But if it's rare enough, then that's okay for us.

> Anyway, its an ugly problem and one I really want to contain inside the
> insanity that created it (virt), lets not taint the rest of the kernel
> more than we need to.

Agreed. Though given that postgres and others use userspace spinlocks,
maybe it's not just virt.

--
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/