Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic

From: Gleb Natapov
Date: Tue Oct 27 2009 - 11:30:51 EST


On Tue, Oct 27, 2009 at 10:47:49AM -0400, Gregory Haskins wrote:
> Gleb Natapov wrote:
> > On Tue, Oct 27, 2009 at 09:39:03AM -0400, Gregory Haskins wrote:
> >> Gleb Natapov wrote:
> >>> On Mon, Oct 26, 2009 at 12:21:57PM -0400, Gregory Haskins wrote:
> >>>> The current code suffers from the following race condition:
> >>>>
> >>>> thread-1 thread-2
> >>>> -----------------------------------------------------------
> >>>>
> >>>> kvm_set_irq() {
> >>>> rcu_read_lock()
> >>>> irq_rt = rcu_dereference(table);
> >>>> rcu_read_unlock();
> >>>>
> >>>> kvm_set_irq_routing() {
> >>>> mutex_lock();
> >>>> irq_rt = table;
> >>>> rcu_assign_pointer();
> >>>> mutex_unlock();
> >>>> synchronize_rcu();
> >>>>
> >>>> kfree(irq_rt);
> >>>>
> >>>> irq_rt->entry->set(); /* bad */
> >>>>
> >>> This is not what happens. irq_rt is never accessed outside read-side
> >>> critical section.
> >> Sorry, I was generalizing to keep the comments short. I figured it
> >> would be clear what I was actually saying, but realize in retrospect
> >> that I was a little ambiguous.
> >>
> > A little is underestimation :) There is not /* bad */ line in the code!
>
> Sorry, that was my own highlighting, not trying to reflect actual code.
>
> >
> >> Yes, irq_rt is not accessed outside the RSCS. However, the entry
> >> pointers stored in the irq_rt->map are, and this is equally problematic
> >> afaict.
> > The pointer is in text and can't disappear without kvm_set_irq()
> > disappearing too.
>
> No, the entry* pointer is .text _AND_ .data, and is subject to standard
> synchronization rules like most other objects.
>
> Unless I am misreading the code, the entry* pointers point to heap
> within the irq_rt pointer. Therefore, the "kfree(irq_rt)" I mention
> above effectively invalidates the entire set of entry* pointers that you
> are caching, and is thus an issue.
>
I think you are missing that the content of the entry is copied, not
pointer to the entry:
irq_set[i++] = *e;
> >
> >> In this particular case we seem to never delete entries at run-time once
> >> they are established. Therefore, while perhaps sloppy, its technically
> >> safe to leave them unprotected from this perspective.
>
> Note: I was wrong in this statement. I forgot that it's not safe at
> run-time either since the entry objects are part of irq_rt.
>
> >> The issue is more
> >> related to shutdown since a kvm_set_irq() caller could be within the
> >> aforementioned race-region and call entry->set() after the guest is
> >> gone. Or did I miss something?
> >>
> > The caller of kvm_set_irq() should hold reference to kvm instance, so it
> > can't disappear while you are inside kvm_set_irq(). RCU protects only
> > kvm->irq_routing not kvm structure itself.
>
> Agreed, but this has nothing to do with protecting the entry* pointers.
>
There are not used outside critical section.

> >
> >>> Data is copied from irq_rt onto the stack and this copy is accessed
> >>> outside critical section.
> >> As mentioned above, I do not believe this really protect us. And even
> > I don't see the prove it doesn't, so I assume it does.
>
> What would you like to see beyond what I've already provided you? I can
> show how the entry pointers are allocated as part of the irq_rt, and I
> can show how the irq_rt (via entry->set) access is racy against
> invalidation.
>
> >
> >> if it did, the copy is just a work-around to avoid sleeping within the
> > It is not a work-around. There was two solutions to the problem one is
> > to call ->set() outside rcu critical section
>
> This is broken afaict without taking additional precautions, such as a
> reference count on the irq_rt structure, but I mentioned this alternate
> solution in my header.
>
> > another is to use SRCU. I
> > decided to use the first one. This way the code is much simpler
>
> "simpler" is debatable, but ok. SRCU is an established pattern
> available in the upstream kernel, so I don't think its particularly
> complicated or controversial to use.
>
> > and I remember asking Paul what are the disadvantages of using SRCU and there
> > was something.
> >
>
> The disadvantages to my knowledge are as follows:
>
> 1) rcu_read_lock is something like 4x faster than srcu_read_lock(), but
> we are talking about nanoseconds on modern hardware (I think Paul quoted
> me 10ns vs 45ns on his rig). I don't think either overhead is something
> to be concerned about in this case.
>
If we can avoid why not? Nanoseconds tend to add up.

> 2) standard rcu supports deferred synchronization (call_rcu()), as well
> as barriers (synchronize_rcu()), whereas SRCU only supports barriers
> (synchronize_srcu()). We only use the barrier type in this code path,
> so that is not an issue.
Agree.

>
> 3) SRCU requires explicit initialization/cleanup, whereas regular RCU
> does not. Trivially solved in my patch since KVM has plenty of
> init/cleanup hook points.
>
No problem here too.

> >> standard RCU RSCS, which is what SRCU is designed for. So rather than
> >> inventing an awkward two-phased stack based solution, it's better to
> >> reuse the provided tools, IMO.
> >>
> >> To flip it around: Is there any reason why an SRCU would not work here,
> >> and thus we were forced to use something like the stack-copy approach?
> >>
> > If SRCU has no disadvantage comparing to RCU why not use it always? :)
>
> No one is debating that SRCU has some disadvantages to RCU, but it
> should also be noted that RCU has disadvantages as well (for instance,
> you can't sleep within the RSCS except for preemptible-based configurations)
>
> The differences between them is really not the issue. The bottom line
> is that upstream KVM irq_routing code is broken afaict with the
> application of RCU alone.
>
> IMO: Its not the tool for the job: At least, not when used alone. You
> either need RCU + reference count (which has more overhead than SRCU due
> to the atomic ops), or SRCU. There may perhaps be other variations on
> this theme, as well, and I am not married to SRCU as the solution, per
> se. But it is *a* solution that I believe works, and IMO its the
> best/cleanest/simplest one at our disposal.
>



--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/