Re: please fix FUSION (Was: [v3.13][v3.14][Regression]kthread:makekthread_create()killable)

From: James Bottomley
Date: Sat Mar 22 2014 - 16:49:11 EST


On Sat, 2014-03-22 at 20:25 +0100, Oleg Nesterov wrote:
> On 03/22, Tetsuo Handa wrote:
> >
> > Oleg Nesterov wrote:
> > > Tetsuo, what do you think?
> >
> > I don't like blocking SIGKILL while doing operations that depend on memory
> > allocation by other processes. If the OOM killer is triggered and it chose
> > the process blocking SIGKILL in mptsas_init() (I know it unlikely happens),
> > it generates the OOM killer deadlock.
>
> It seems that we do not understand each other.
>
> I do not like this hack too. And it is even wrong, you can't really block
> SIGKILL. And it is unnecessary in a sense that (I think) it is fine that
> module_init() reacts to SIGKILL and aborts, just the fact it crashes the
> kernel in the error paths is not fine.
>
> The driver should be fixed anyway. As for timeout, either userspace/systemd
> should be changed to not send SIGKILL after 30 secs, or (better) the driver
> should be changed to not waste 30 secs.
>
> The hack I sent can only serve as a short term solution, it should be
> reverted once we have something better. And, otoh, this hack only affects
> the problematic driver which should be fixed in any case.
>
> Do you see my point? I can be wrong, but I think that you constantly
> misunderstand the intent.
>
> > My preference is to fix kthread_create() to handle SIGKILL gracefully.
>
> And now I do not understand you too. I do not understand why should we
> "fix" kthread_create().
>
> > Many kernel operations (e.g. mutex_lock() wait_event() wait_for_completion())
> > ignore SIGKILL and the current users depend on SIGKILL being ignored. Thus,
> > commit 786235ee sounds like a kernel API breakage.
>
> Personally I do not really think so, but OK. In this case lets revert
> 786235ee.
>
> > Commit 786235ee "kthread: make kthread_create() killable" changed to
> > leave kthread_create() as soon as receiving SIGKILL. But this change
> > caused boot failures because systemd-udevd receives SIGKILL due to timeout
> > while loading SCSI controller drivers using finit_module() [1].
>
> And I still think that 786235ee simply uncovered the problems in this
> driver. Perhaps we should change kthread_create() for some reason, but
> (imho) not because we need to help the buggy code.
>
> > Therefore, this patch changes kthread_create() from "wait forever in
> > killable state" to "wait for 10 seconds in unkillable state, check for
> > the OOM killer every second".
>
> Personally I dislike this change. In fact I think it is ugly. But this
> is only my opinion.
>
> If you convince someone to take this patch I won't argue.

OK, the fix from the SCSI point of view is to make the mpt teardown path
actually work. The whole thing looks to be a complete mess because
there's another place where it will do the wrong thing in
mptscsih_remove(). You always have to call mpt_detach() otherwise the
device doesn't get removed from the lists. In theory this patch fixes
both bugs in the driver.

James

---

diff --git a/drivers/message/fusion/mptscsih.c b/drivers/message/fusion/mptscsih.c
index 727819c..282d39a 100644
--- a/drivers/message/fusion/mptscsih.c
+++ b/drivers/message/fusion/mptscsih.c
@@ -1176,10 +1176,14 @@ mptscsih_remove(struct pci_dev *pdev)
MPT_SCSI_HOST *hd;
int sz1;

+ if (!host)
+ /* not brought up far enough to do scsi_host_attach() */
+ goto out;
+
scsi_remove_host(host);

if((hd = shost_priv(host)) == NULL)
- return;
+ goto out;

mptscsih_shutdown(pdev);

@@ -1203,6 +1207,7 @@ mptscsih_remove(struct pci_dev *pdev)

scsi_host_put(host);

+ out:
mpt_detach(pdev);

}


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/