Re: [PATCH 25/30] ACPI / hotplug / PCI: Check for new devices on enabled slots

From: Rafael J. Wysocki
Date: Thu Sep 05 2013 - 07:44:15 EST


On Wednesday, September 04, 2013 10:06:02 PM Alex Williamson wrote:
> On Wed, 2013-09-04 at 21:37 -0600, Alex Williamson wrote:
> > On Thu, 2013-09-05 at 01:35 +0200, Rafael J. Wysocki wrote:
> > > On Wednesday, September 04, 2013 05:12:14 PM Alex Williamson wrote:
> > > > On Thu, 2013-09-05 at 00:54 +0200, Rafael J. Wysocki wrote:
> > > > > On Wednesday, September 04, 2013 02:36:34 PM Alex Williamson wrote:
> > > > > > On Thu, 2013-07-18 at 01:32 +0200, Rafael J. Wysocki wrote:
> > > > > > > From: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
> > > > > > >
> > > > > > > The current implementation of acpiphp_check_bridge() is pretty dumb:
> > > > > > > - It enables a slot if it's not enabled and the slot status is
> > > > > > > ACPI_STA_ALL.
> > > > > > > - It disables a slot if it's enabled and the slot status is not
> > > > > > > ACPI_STA_ALL.
> > > > > > >
> > > > > > > This behavior is not sufficient to handle the Thunderbolt daisy
> > > > > > > chaining case properly, however, because in that case the bus
> > > > > > > behind the already enabled slot needs to be rescanned for new
> > > > > > > devices.
> > > > > > >
> > > > > > > For this reason, modify acpiphp_check_bridge() so that slots are
> > > > > > > disabled and stopped if they are not in the ACPI_STA_ALL state.
> > > > > > >
> > > > > > > For slots in the ACPI_STA_ALL state, devices behind them that don't
> > > > > > > respond are trimmed using a new function, trim_stale_devices(),
> > > > > > > introduced specifically for this purpose. That function walks
> > > > > > > the given bus and checks each device on it. If the device doesn't
> > > > > > > respond, it is assumed to be gone and is removed.
> > > > > > >
> > > > > > > Once all of the stale devices directy behind the slot have been
> > > > > > > removed, acpiphp_check_bridge() will start looking for new devices
> > > > > > > that might have appeared on the given bus. It will do that even if
> > > > > > > the slot is already enabled (SLOT_ENABLED is set for it).
> > > > > > >
> > > > > > > In addition to that, make the bus check notification ignore
> > > > > > > SLOT_ENABLED and go for enable_device() directly if bridge is NULL,
> > > > > > > so that devices behind the slot are re-enumerated in that case too.
> > > > > > >
> > > > > > > This change is based on earlier patches from Kirill A Shutemov
> > > > > > > and Mika Westerberg.
> > > > > > >
> > > > > > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
> > > > > > > Tested-by: Mika Westerberg <mika.westerberg@xxxxxxxxxxxxxxx>
> > > > > > > ---
> > > > > >
> > > > > > FYI, git bisect landed on this patch as the cause of my serial console
> > > > > > dying on current upstream. Further debugging to come... Thanks,
> > > > >
> > > > > Well, sorry about that.
> > > > >
> > > > > What exactly do you mean by "dying"?
> > > >
> > > > Sorry, I was hoping to have more details quickly, but it's been a pain
> > > > to debug. By dying I mean serial console output suddenly stops during
> > > > kernel boot and nothing more comes out of it until after the system is
> > > > rebooted. The problem happens when acpiphp_check_bridge() calls
> > > > enable_slot(). The serial console dies somewhere down in
> > > > acpiphp_bus_trim(). I think this is happening on the 00:1f ISA bridge,
> > > > so there's a good chance the serial ports are described as somewhere
> > > > under there.
> > >
> > > Can you please check if that is the acpiphp_bus_trim() called by
> > > acpiphp_bus_add() or the other one called from trim_stale_devices()?
> > >
> > > Just add a dump_stack() or WARN_ON(1) to trim_stale_devices() next to
> > > the acpiphp_bus_trim() call and see if that triggers. I *think* it's the one
> > > in acpiphp_bus_add(), but it won't hurt to verify that.
> >
> > Here's the call path:
> >
> > [ 16.120824] [<ffffffff81627e6c>] dump_stack+0x55/0x76
> > [ 16.125979] [<ffffffff8162132e>] enable_slot+0x4ee/0x5e0
> > [ 16.131396] [<ffffffff813418fb>] ? trim_stale_devices+0x5b/0xf0
> > [ 16.137420] [<ffffffff81341b35>] acpiphp_check_bridge+0xd5/0x110
> > [ 16.143531] [<ffffffff81342acb>] hotplug_event+0x16b/0x260
> > [ 16.149115] [<ffffffff81072cd9>] ? process_one_work+0x189/0x540
> > [ 16.155136] [<ffffffff81342bf0>] hotplug_event_work+0x30/0x70
> > [ 16.160978] [<ffffffff81072d3b>] process_one_work+0x1eb/0x540
> > [ 16.166819] [<ffffffff81072cd9>] ? process_one_work+0x189/0x540
> > [ 16.172836] [<ffffffff8107353c>] worker_thread+0x11c/0x370
> > [ 16.178426] [<ffffffff81073420>] ? rescuer_thread+0x350/0x350
> > [ 16.184276] [<ffffffff8107b0ea>] kthread+0xea/0xf0
> > [ 16.189165] [<ffffffff8107b000>] ? kthread_create_on_node+0x160/0x160
> > [ 16.195700] [<ffffffff816395dc>] ret_from_fork+0x7c/0xb0
> > [ 16.201109] [<ffffffff8107b000>] ? kthread_create_on_node+0x160/0x160
> >
> > The actual death of the serial console occurs in acpi_device_set_power()
> > called from:
> >
> > enable_slot()
> > acpiphp_bus_add()
> > acpiphp_bus_trim()
> > acpi_bus_trim()
> > acpi_walk_namespace()
> > acpi_bus_remove()
> > acpi_device_unregister()
> > acpi_device_set_power()
> >
> > I can't seem to get a path from the acpi devices in question there, so I
> > have no idea what's getting trimmed here. It worries me quite a bit by
> > introducing this trimming that apparently wasn't happening before
> > though. Thanks,
>
> As suspected, the pnp.bus_id/id of the last device before the serial
> console dies is COM1/PNP0501. I also see all of these being trimmed
> out:
>
> MBRD/PNP0C02
> DMAC/PNP0200
> MATH/PNP0C04
> PIC/PNP0000
> HPET/PNP0103
> RTC/PNP0B00
> SPKR/PNP0800
> TIME/PNP0100
> LNK{A-H}/PNP0C0F
>
> This seems like a bad idea. I forgot to mention, the original
> hotplug_event is called with a device check on \_SB_.PCI0.PEX2. The box
> where I'm seeing this is a pretty generic X58 based Nehalem workstation
> (Lenovo S20).

Thanks for the info!

The acpiphp_bus_trim() in acpiphp_bus_add() is a leftover and a bad one.
I don't think it's actually necessary, at least Thunderbolt works without
it just fine on my Aspire S5.

The patch below should help, can you please test it?

Rafael

---
From: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
Subject: ACPI / hotplug / PCI: Don't trim devices before scanning the namespace

In acpiphp_bus_add() we first remove device objects corresponding to
the given handle and the ACPI namespace branch below it which are
then re-created by acpi_bus_scan(). This used to be done to clean
up after surprise removals, but now we do the cleanup through
trim_stale_devices() which checks if the devices in question are
actually gone before removing them, so the device hierarchy trimming
in acpiphp_bus_add() is not necessary any more and, moreover, it may
lead to problems if it removes device objects corresponding to
devices that are actually present.

Reported-by: Alex Williamson <alex.williamson@xxxxxxxxxx>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
---
drivers/pci/hotplug/acpiphp_glue.c | 1 -
1 file changed, 1 deletion(-)

Index: linux-pm/drivers/pci/hotplug/acpiphp_glue.c
===================================================================
--- linux-pm.orig/drivers/pci/hotplug/acpiphp_glue.c
+++ linux-pm/drivers/pci/hotplug/acpiphp_glue.c
@@ -487,7 +487,6 @@ static void acpiphp_bus_add(acpi_handle
{
struct acpi_device *adev = NULL;

- acpiphp_bus_trim(handle);
acpi_bus_scan(handle);
acpi_bus_get_device(handle, &adev);
if (adev)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/