Re: 2.6.26-rc4: RIP __call_for_each_cic+0x20/0x50

From: Paul E. McKenney
Date: Thu May 29 2008 - 05:18:39 EST


On Thu, May 29, 2008 at 08:42:02AM +0200, Jens Axboe wrote:
> On Thu, May 29 2008, Jens Axboe wrote:
> > > But one additional question...
> > >
> > > static void cfq_cic_free_rcu(struct rcu_head *head)
> > > {
> > > struct cfq_io_context *cic;
> > >
> > > cic = container_of(head, struct cfq_io_context, rcu_head);
> > >
> > > kmem_cache_free(cfq_ioc_pool, cic);
> > > elv_ioc_count_dec(ioc_count);
> > >
> > > if (ioc_gone && !elv_ioc_count_read(ioc_count))
> > > complete(ioc_gone);
> > > }
> > >
> > > Suppose that a pair of tasks both execute the elv_ioc_count_dec()
> > > at the same time, so that all counters are now zero. Both then
> > > find that there is still an ioc_gone, and that the count is
> > > now zero. One of the tasks invokes complete(ioc_gone). This
> > > awakens the corresponding cfq_exit(), which now returns, getting
> > > rid of its stack frame -- and corrupting the all_gone auto variable
> > > that ioc_gone references.
> > >
> > > Now the second task gets a big surprise when it tries to invoke
> > > complete(ioc_gone).
> > >
> > > Or is there something else that I am missing here?
> >
> > No, I think that's a problem spot as well. To my knowledge, nobody has
> > ever hit that. The anticipatory scheduler has the same code.
> >
> > What we want to avoid here is making cfq_cic_free_rcu() a lot more
> > expensive, which is why the elv_ioc_count_read() is behind that
> > ioc_gone check. I'll need to think a bit on how to handle that
> > better :-)
>
> So how about this? Add a spinlock for checking and clearing ioc_gone
> back to NULL. It doesn't matter if we make the ioc_gone != NULL
> case a little more expensive, as it will only happen on cfq-iosched
> module unload. And it seems the clearest way of making this safe.
> The last hunk should really not be necessary, as ioc_gone wont be
> set back to NULL before wait_for_completion() is entered.

Looks better! I do have one scenario that seems troublesome, but
it should be easy to fix, see below. (Assuming it really is a
problem, that is...)

Thanx, Paul

> An identical patch is needed in AS as well.
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index d01b411..32aa367 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -48,6 +48,7 @@ static struct kmem_cache *cfq_ioc_pool;
>
> static DEFINE_PER_CPU(unsigned long, ioc_count);
> static struct completion *ioc_gone;
> +static DEFINE_SPINLOCK(ioc_gone_lock);
>
> #define CFQ_PRIO_LISTS IOPRIO_BE_NR
> #define cfq_class_idle(cfqq) ((cfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
> @@ -1177,8 +1178,19 @@ static void cfq_cic_free_rcu(struct rcu_head *head)
> kmem_cache_free(cfq_ioc_pool, cic);
> elv_ioc_count_dec(ioc_count);
>
> - if (ioc_gone && !elv_ioc_count_read(ioc_count))
> - complete(ioc_gone);
> + if (ioc_gone) {
> + /*
> + * CFQ scheduler is exiting, grab exit lock and check
> + * the pending io context count. If it hits zero,
> + * complete ioc_gone and set it back to NULL
> + */

Suppose that at this point some other CPU does the last complete().
They have set ioc_gone to NULL, so everything is fine. But suppose
that in the meantime, some other CPU sets up a cfq and then starts
tearing it down. Then ioc_gone would be non-NULL, and we would cause
this new teardown to end prematurely.

If this is a real problem, one way to get around it is to have a
generation number. We capture this before doing the elv_ioc_count_dec()
(alas, with a memory barrier between the capture and the elv_ioc_count_dec()),
and then check it under the lock. If it has changed, we know someone else
has already done the awakening for us. Increment the generation number
in the same place that ioc_gone is set to NULL.

Seem reasonable?

> + spin_lock(&ioc_gone_lock);
> + if (ioc_gone && !elv_ioc_count_read(ioc_count)) {
> + complete(ioc_gone);
> + ioc_gone = NULL;
> + }
> + spin_unlock(&ioc_gone_lock);
> + }
> }
>
> static void cfq_cic_free(struct cfq_io_context *cic)
> @@ -2317,7 +2329,7 @@ static void __exit cfq_exit(void)
> * pending RCU callbacks
> */
> if (elv_ioc_count_read(ioc_count))
> - wait_for_completion(ioc_gone);
> + wait_for_completion(&all_gone);
> cfq_slab_kill();
> }
>
>
> --
> Jens Axboe
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/