Re: [PATCH] ad_sigma_delta: fix race between IRQ and completion

From: Jonathan Cameron
Date: Sat Dec 31 2022 - 09:15:24 EST


On Sat, 24 Dec 2022 15:31:58 +1300
Daniel Beer <dlbeer@xxxxxxxxx> wrote:

> On Fri, Dec 23, 2022 at 04:16:59PM +0000, Jonathan Cameron wrote:
> > > ad_sigma_delta waits for a conversion which terminates with the firing
> > > of a one-shot IRQ handler. In this handler, the interrupt is disabled
> > > and a completion is set.
> > >
> > > Meanwhile, the thread that initiated the conversion is waiting on the
> > > completion to know when the conversion happened. If this wait times out,
> > > the conversion is aborted and IRQs are disabled. But the IRQ may fire
> > > anyway between the time the completion wait times out and the disabling
> > > of interrupts. If this occurs, we get a double-disabled interrupt.
> >
> > Ouch and good work tracking it down. just to check, did you see this
> > bug happen in the wild or spotted by code inspection?
>
> Hi Jonathan,
>
> Thanks for reviewing. It was by inspection -- I'd originally thought
> about it and fixed in in a similar way in this patch:
>
> https://lore.kernel.org/all/61dd3e0c.1c69fb81.cea15.8d98@xxxxxxxxxxxxx/
>
> But since that's not applied, I thought I'd better put together a
> separate fix for the time being.
>
> > Given that timeout generally indicates hardware failure, I'm not sure
> > how critical this is to fix.
>
> Probably not very critical. I think you'd have to be pretty unlucky to
> encounter it.
>
> > Is this fix sufficient? If the interrupt is being handled on a different
> > CPU to the caller of this function, I think we can still race enough that
> > this fails to fix it up. Might need a spinlock to prevent that.
> >
> > CPU 0 CPU 1
> > ad_sd_data_rdy_trig_poll() ad_sd_wait_and_disable()
> >
> > //wait_for_completion ends
> >
> > Interrupt
> > disable_irq()
> > if (sigma-delta->irq_dis) !true
> > else
> > sigma_delta->irq_dis = true
> >
> > disable_irq_nosync(irq)
> > sigma_delta->irq_dis = true;
> >
> > So we still end up with a doubly disabled irq. Add a spinlock to make the
> > disable and the setting of sigma_delta->irq_dis atomic then it should all be fine.
>
> My understanding is that the suffix-less version of disable_irq would
> wait for all running handlers on other CPUs (i.e.
> ad_sd_data_rdy_trig_poll) to finish before proceeding, which would
> prevent this from happening. Is that not the case?

Ah. I'd missed that - it is indeed documented as waiting for
pending irq handlers so that race doesn't exist.


>
> But now that you mention it, there is another small problem: in the case
> where the conversion doesn't time out, the interrupt handler will call
> complete() and then perform some operations on the struct
> ad_sigma_delta.
>
> This is always ok on a single processor, but if there are multiple CPUs
> there is possibly a brief period where both the interrupt handler and
> the waiting thread are accessing the ad_sigma_delta struct without
> synchronization between them.
>
> Not sure if that's really a problem in practice, but I think an easy way
> to rule it out would just be to move the complete() call to the bottom
> of the handler and make sure it doesn't touch the structure again after
> that.

Hmm. At first glance, you are correct that moving completion later probably
makes sense. There may be some subtleties in that ordering though so I definitely want
some feedback from those more familiar with this driver than I am before
taking this patch or that further change.

Jonathan


>
> Cheers,
> Daniel
>