Re: 3.16-rcX crashes on resume from Suspend-To-RAM

From: Markus Gutschke
Date: Sat Aug 16 2014 - 05:55:23 EST


I collected all the data that you asked for and attached it to the
bug: https://bugzilla.kernel.org/show_bug.cgi?id=80911

Yes, both acpidump output and the list of PNP devices changes when I
update the kernel. I was hoping to give you a brief "diff" output for
the changes; but there are too many changes for that to make much
sense. In any case, you can see it by running:

diff -u last-good/dirtree_\!sys\!devices\!pnp0.txt
first-bad/dirtree_\!sys\!devices\!pnp0.txt

I included a README.txt that describes the contents of all the files.
I hope this makes some sense and I hope it is sufficiently complete
for you to make progress in debugging why my machine is unhappy.
Please don't hesitate to ask, if you think I can provide other data
and/or run other tests.


Markus


On Fri, Aug 15, 2014 at 5:46 PM, Rafael J. Wysocki <rjw@xxxxxxxxxxxxx> wrote:
> On Friday, August 15, 2014 10:17:42 AM Markus Gutschke wrote:
>> Just wondering if any of you had any other ideas of what I could try
>> to help debug this problem?
>
> My theory is that there is a device in your system that we don't have a driver
> for, but it had been enumerated as a PNP device before the change that triggered
> the problem for you and we turned it off during suspend as part of the default
> ACPI PNP device handling.
>
> The reason why you're seeing a crash with the "platform" test level is most
> likely that the _WAK control method does something unusual on your system.
>
> The LNXSYBUS:00 thing from dmesg probably is a red herring.
>
> I need the output of acpidump from the affected system, but please attach it
> to the bug entry at https://bugzilla.kernel.org/show_bug.cgi?id=80911 that
> Rui has created for this issue.
>
> Also please check the list of PNP devices under
>
> /sys/bus/pnp/devices/
>
> before and after the commit you have found by bisection and let me know if
> there are any differences.
>
>
>> On Tue, Aug 12, 2014 at 9:11 AM, Markus Gutschke <markus@xxxxxxxxxxxx> wrote:
>> > As I said earlier in this thread, echo'ing "devices" into "pm_test"
>> > does not result in a crash; but doing so for "platform" does.
>> >
>> > Markus
>> >
>> > On Aug 12, 2014 1:26 AM, "Zhang Rui" <rui.zhang@xxxxxxxxx> wrote:
>> >>
>> >> On Sat, 2014-08-09 at 03:14 -0700, Markus Gutschke wrote:
>> >> > I am back and have physical access to the machine now.
>> >> >
>> >> great!
>> >>
>> >> > I re-ran the test just to be sure, and I can confirm that "platform"
>> >> > does in fact result in a crash.
>> >> >
>> >> what about "devices"?
>> >> I mean
>> >>
>> >> # echo devices > /sys/power/pm_test
>> >>
>> >> and see if that triggers the crash.
>> >>
>> >> > Furthermore, I ran the test that Rui asked for. I suspended, resumed,
>> >> > and upon crashing power-cycled the machine ASAP. "dmesg" suggests that
>> >> > the problem is with LNXSYBUS:00 That doesn't tell me much, but
>> >> > hopefully it makes sense to you guys.
>> >> >
>> >> [ 0.930093] Magic number: 10:810:122
>> >> [ 0.930185] acpi LNXSYBUS:00: hash matches
>> >>
>> >> This looks weird, ACPI will do nothing for LNXSYBUS devices during
>> >> resume.
>> >> Rafael, any thought on this?
>> >>
>> >> thanks,
>> >> rui
>> >>
>
> --
> I speak only for myself.
> Rafael J. Wysocki, Intel Open Source Technology Center.
This archive contains debugging information retrieved for the following kernel
versions:

455c6fdbd219161bd09b1165f11699d6d73de11c: Linux 3.14
1860e379875dfe7271c649058aeddffe5afd9d0d: Linux 3.15
aca0a4eb4e325914ddb22a8ed06fcb0222da2a26: Last good commit
eec15edbb0e14485998635ea7c62e30911b465f0: First bad commit
b04c58b1ed26317bfb4b33d3a2d16377fc6acd0f: Still bad (merge branch "acpi-enumeration")
b04c58b1ed26317bfb4b33d3a2d16377fc6acd0f.PATCHED: Patched with a debug patch provided by Rui

After eec15edbb0e14485998635ea7c62e30911b465f0, the kernel can no longer resume
from suspend. The crash looks like what is shown in "crash.png".

Even for defective kernels, "echo devices >/sys/power/pm_test" completes successfully,
whereas "echo platform >/sys/power/pm_test" triggers the crash.

CONFIG_PM_TRACE_RTC suggests that the bug is caused by LNXSYBUS:00:, but that might be
incorrect.

Please note that "dmesg" shows an early stack trace during boot. This might or might
not be related.

Please also not that the output from "acpidump" changes between kernel versions.

I also included the output from
grep . /sys/bus/pnp/devices/*/firmware_node/*
grep . /sys/bus/pnp/devices/*/*
grep . /sys/bus/platform/devices/*/firmware_node/*
grep . /sys/bus/platform/devices/*/*
for each of the kernels.

And the directory tree underneath /sys/bus/pnp/devices/ -> /sys/devices/pnp0/.