Re: Fw: [PATCH -mm] workqueue: debug possible endless loop in cancel_rearming_delayed_work

From: Jarek Poplawski
Date: Thu Apr 26 2007 - 08:53:48 EST


On Wed, Apr 25, 2007 at 06:47:59PM +0400, Oleg Nesterov wrote:
> On 04/25, Oleg Nesterov wrote:
> >
> > On 04/25, Jarek Poplawski wrote:
> > >
> > > Probably this is also possible without timer i.e.
> > > with queue_work.
> >
> > Yes, thanks. While adding cpu-hotplug check I forgot to add ->current_work
> > check, which is needed to actually implement this
> >
> > > > Note that cancel_rearming_delayed_work() now can handle the works
> > > > which re-arm itself via queue_work(), not only queue_delayed_work().
> >
> > part. I'll resend after fix.
>
> Hm. But can't we do better? Looks like we don't need to check ->current_work,
>
> void cancel_rearming_delayed_work(struct delayed_work *dwork)
> {
> struct work_struct *work = &dwork->work;
> struct cpu_workqueue_struct *cwq = get_wq_data(work);
> int done;

I don't understand, why you think cwq cannot be NULL here.

>
> do {
> done = 1;
> spin_lock_irq(&cwq->lock);
>
> if (!list_empty(&work->entry))
> list_del_init(&work->entry);

BTW, isn't needs_a_good_name needles after this and after del_timer positive?

> else if (test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work)))
> done = del_timer(&dwork->timer)

If this runs while a work function is fired in run_workqueue,
it sets _PENDING bit, but if the work skips rearming, we have probably
endless loop, again.

>
> spin_unlock_irq(&cwq->lock);
> } while (!done);
>
> /*
> * Nobody can clear WORK_STRUCT_PENDING. This means that the
> * work can't be re-queued and the timer can't be re-started.
> */
> needs_a_good_name(cwq->wq, work);
> work_clear_pending(work);
> }
>
> Jarek, I didn't think much about this, just a new idea. I am posting this code
> in a hope you can review it while I sleep on this... CPU-hotplug is ignored for
> now. Note that this version doesn't need the change in run_workqueue().

It's very interesting proposal, but also hard to analyse - the locks
are taken and given away, and there is hard to forsee, when and where
the loop regains the lock again. It is something alike to the current
way, with some added measures: you try to shoot a work on the run,
while queued or timer_pending, plus the _PENDING flag set, so it seems,
there is some risk of longer than planed looping.

I have to look at this more, at home and, if something new, I'll write
tomorrow. So, the good news, is you should have enough sleep this time!

Cheers,
Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/