Re: [Intel-wired-lan] [BUG] 4.11.0-rc1 panic on shutdown X61s

From: Neftin, Sasha
Date: Tue Mar 14 2017 - 03:49:36 EST


On 3/14/2017 03:20, Brown, Aaron F wrote:
From: BjÃrn Mork [mailto:bjorn@xxxxxxx]
Sent: Monday, March 13, 2017 9:46 AM
To: Borislav Petkov <bp@xxxxxxxxx>
Cc: Andy Shevchenko <andy.shevchenko@xxxxxxxxx>; lkml@xxxxxxxxxxx;
linux-kernel <linux-kernel@xxxxxxxxxxxxxxx>; vcaputo@xxxxxxxxxxx; linux-
pci@xxxxxxxxxxxxxxx; intel-wired-lan@xxxxxxxxxxxxxxxx; khalidm
<khalidm@xxxxxxxxx>; David Singleton <davsingl@xxxxxxxxx>; Brown, Aaron
F <aaron.f.brown@xxxxxxxxx>; Kirsher, Jeffrey T
<jeffrey.t.kirsher@xxxxxxxxx>
Subject: Re: [BUG] 4.11.0-rc1 panic on shutdown X61s

Borislav Petkov <bp@xxxxxxxxx> writes:
On Sun, Mar 12, 2017 at 03:55:08PM +0200, Andy Shevchenko wrote:

The only change that IMHO matters happened between v4.10 and v4.11-
rc1 is this:
@@ -6276,8 +6274,8 @@ static int e1000e_pm_freeze(struct device *dev)
/* Quiesce the device without resetting the hardware */
e1000e_down(adapter, false);
e1000_free_irq(adapter);
+ e1000e_reset_interrupt_capability(adapter);
}
- e1000e_reset_interrupt_capability(adapter);

So, it apparently misses something for the other case, like
pci_disable_msi() call or so.
Well, lemme add the people from

7e54d9d063fa ("e1000e: driver trying to free already-free irq")

to CC then. :-)
Already did that a week ago:
https://www.spinics.net/lists/netdev/msg423379.html

Haven't heard anything back yet. Wondering if they are waiting for
someone else to submit the pretty obvious revert? Don't understand why
that should take more than a minute to figure out. It's not like they
are testing these changes anyway...
Believe it or not we actually do test these changes. This one was tested by me and I did not have the same results you and the other people reporting this trace did. I made it back in the lab today and have spent a good part of the day attempting to reproduce this bug without success. Freeze / resume works for me on all the systems I have tried, which includes a sampling of all the current parts and many older ones. Given there are several other reports of this it is obviously an issue and I would like to be able to reproduce it in case another patch to resolve the issue this attempts to fix comes back in another form. So I want to know what's different between the systems that hit this and my bank of systems that don't.

What exact part (or parts) are we looking at (lspci|grep -i eth) that trigger this? Could it be a difference in .config files? The trace says it is falling back to legacy interrupts, does the system continue to work and does the network continue to function in that mode? In case it's related to user space what is the base distro? Any other information you think can help me reproduce the issue would be appreciated.

Thanks,
Aaron


BjÃrn
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@xxxxxxxxxxxxxxxx
http://lists.osuosl.org/mailman/listinfo/intel-wired-lan

Hello,

I suggest revert commit of this patch. We recommended do not apply this change.

Thanks,

Sasha