Re: 2.6.26-rc4: RIP __call_for_each_cic+0x20/0x50

From: Paul E. McKenney
Date: Thu May 29 2008 - 07:26:17 EST


On Thu, May 29, 2008 at 12:13:54PM +0200, Jens Axboe wrote:
> On Thu, May 29 2008, Paul E. McKenney wrote:
> > On Thu, May 29, 2008 at 08:42:02AM +0200, Jens Axboe wrote:
> > > On Thu, May 29 2008, Jens Axboe wrote:
> > > > > But one additional question...
> > > > >
> > > > > static void cfq_cic_free_rcu(struct rcu_head *head)
> > > > > {
> > > > > struct cfq_io_context *cic;
> > > > >
> > > > > cic = container_of(head, struct cfq_io_context, rcu_head);
> > > > >
> > > > > kmem_cache_free(cfq_ioc_pool, cic);
> > > > > elv_ioc_count_dec(ioc_count);
> > > > >
> > > > > if (ioc_gone && !elv_ioc_count_read(ioc_count))
> > > > > complete(ioc_gone);
> > > > > }
> > > > >
> > > > > Suppose that a pair of tasks both execute the elv_ioc_count_dec()
> > > > > at the same time, so that all counters are now zero. Both then
> > > > > find that there is still an ioc_gone, and that the count is
> > > > > now zero. One of the tasks invokes complete(ioc_gone). This
> > > > > awakens the corresponding cfq_exit(), which now returns, getting
> > > > > rid of its stack frame -- and corrupting the all_gone auto variable
> > > > > that ioc_gone references.
> > > > >
> > > > > Now the second task gets a big surprise when it tries to invoke
> > > > > complete(ioc_gone).
> > > > >
> > > > > Or is there something else that I am missing here?
> > > >
> > > > No, I think that's a problem spot as well. To my knowledge, nobody has
> > > > ever hit that. The anticipatory scheduler has the same code.
> > > >
> > > > What we want to avoid here is making cfq_cic_free_rcu() a lot more
> > > > expensive, which is why the elv_ioc_count_read() is behind that
> > > > ioc_gone check. I'll need to think a bit on how to handle that
> > > > better :-)
> > >
> > > So how about this? Add a spinlock for checking and clearing ioc_gone
> > > back to NULL. It doesn't matter if we make the ioc_gone != NULL
> > > case a little more expensive, as it will only happen on cfq-iosched
> > > module unload. And it seems the clearest way of making this safe.
> > > The last hunk should really not be necessary, as ioc_gone wont be
> > > set back to NULL before wait_for_completion() is entered.
> >
> > Looks better! I do have one scenario that seems troublesome, but
> > it should be easy to fix, see below. (Assuming it really is a
> > problem, that is...)
> >
> > Thanx, Paul
> >
> > > An identical patch is needed in AS as well.
> > >
> > > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> > > index d01b411..32aa367 100644
> > > --- a/block/cfq-iosched.c
> > > +++ b/block/cfq-iosched.c
> > > @@ -48,6 +48,7 @@ static struct kmem_cache *cfq_ioc_pool;
> > >
> > > static DEFINE_PER_CPU(unsigned long, ioc_count);
> > > static struct completion *ioc_gone;
> > > +static DEFINE_SPINLOCK(ioc_gone_lock);
> > >
> > > #define CFQ_PRIO_LISTS IOPRIO_BE_NR
> > > #define cfq_class_idle(cfqq) ((cfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
> > > @@ -1177,8 +1178,19 @@ static void cfq_cic_free_rcu(struct rcu_head *head)
> > > kmem_cache_free(cfq_ioc_pool, cic);
> > > elv_ioc_count_dec(ioc_count);
> > >
> > > - if (ioc_gone && !elv_ioc_count_read(ioc_count))
> > > - complete(ioc_gone);
> > > + if (ioc_gone) {
> > > + /*
> > > + * CFQ scheduler is exiting, grab exit lock and check
> > > + * the pending io context count. If it hits zero,
> > > + * complete ioc_gone and set it back to NULL
> > > + */
> >
> > Suppose that at this point some other CPU does the last complete().
> > They have set ioc_gone to NULL, so everything is fine. But suppose
> > that in the meantime, some other CPU sets up a cfq and then starts
> > tearing it down. Then ioc_gone would be non-NULL, and we would cause
> > this new teardown to end prematurely.
> >
> > If this is a real problem, one way to get around it is to have a
> > generation number. We capture this before doing the elv_ioc_count_dec()
> > (alas, with a memory barrier between the capture and the elv_ioc_count_dec()),
> > and then check it under the lock. If it has changed, we know someone else
> > has already done the awakening for us. Increment the generation number
> > in the same place that ioc_gone is set to NULL.
> >
> > Seem reasonable?
>
> This isn't a problem, since cfq_exit() cannot be called before
> all block queues in the system have been detached from CFQ.

And once all block queues have been detached, no future block queues
can ever be attached again? Or perhaps a better way of putting it,
once CFQ has been shut down, can it be restarted without rebooting
the system? If it can be restarted without reboot, then I do not
yet see how the scenario above is avoided.

> cfq_exit() calls elv_unregister() before setting ioc_gone, so
> when elv_unregister() has returned, CFQ is in its own little world.
> Do we need an smp_wmb() between elv_unregister() and the ioc_gone
> assignment to ensure this ordering as well? IIRC, the spin_lock
> and spin_unlock in elv_unregister() isn't enough to guarentee this.
> We are really down to splitting hairs now, but better safe than
> sorry :-)

I believe that the spinlock takes care of that ordering issue. I am
instead worried about a "Rip Van Winkle" effect where a given task
is delayed at a crucial point. By the time it wakes back up, CFQ
has been not only restarted, but is now in the process of being torn
down again. (Assuming that it can in fact be restarted without
a reboot.)

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/