Re: [PATCH 4/4] ACPI PCI slot detection driver

From: Kenji Kaneshige
Date: Tue Mar 04 2008 - 20:13:22 EST


Hi Alex-san,

Sorry for the delay in responding.

It wasn't the IBM machine that was breaking; it was Fujitsu. They
were returning an error code (the numerical value 1023) when I
called the _SUN method on a slot object that existed in the ACPI
namespace but was not present (as reported by the _STA method).

By the time I got that error report, I'd already dropped the
duplicate name detection code, and was letting the kobject
infrastructure warn about duplicate names because for my test
cases, refcounting was a better solution.

[Kenji-san from Fujitsu seemed to be ok with the progress I'd
made at the time, he can speak up if he's changed his mind ;)]

Unfortunatelly, I have not tried the new version of slot detection
driver because of the lack of test environment. Maybe we need more
several days to wait for test environment.

BTW, does the new one fixes the issue I reported before? I could not
find it in the changelog. IIRC, this issue was difficult to solve
because the root cause of this issue is from the difference of
interpretation of ACPI spec between HP and Fujitsu (I still don't
think it's a good idea to evaluate _SUN for the device object whose
_STA is 0).

Anyway, Fujitsu servers which suffer from this issue doesn't need
the slot detection driver because its PCI slots are all hot-pluggable
and it already has /sys/bus/pci/slots/XXXX directory for all PCI
slots. So one of the workaround is to not use slot detection driver.
But when I tried it before, it was automatically loaded at boot up
time and then I got strange slot named '1023' and many warning
messages (stack traces).

Thanks,
Kenji Kaneshige



Alex Chiang wrote:
* Greg KH <gregkh@xxxxxxx>:
On Tue, Mar 04, 2008 at 10:18:28AM -0800, Jesse Barnes wrote:
On Monday, March 03, 2008 9:49 pm Greg KH wrote:
On Sat, Mar 01, 2008 at 07:43:07AM -0700, Matthew Wilcox wrote:
On Fri, Feb 29, 2008 at 09:25:42PM -0800, Greg KH wrote:
What is the guarantee that the names of these slots are correct and do
not happen to be the same as the hotpluggable ones?
That would be a bug -- and yes, bugs happen, and we have to deal with
them.
My main concern is that BIOS vendors will not fix these bugs, as no
other OS cares/does this kind of thing today. The ammount of bad
information out there might be quite large, and I think this was
confirmed by some initial testing of IBM systems, right?
Yeah, but there's a flip side to this too: if no one uses the data, no one will complain when it's wrong. If Linux starts making it easy to see this stuff, there's a chance system vendors will start taking an extra 5 min. before shipment to make sure that the BIOS info is up to date...

OTOH, I'm not sure which is worse, bad data or no data.
bad data is worse.

And then there's the machines with duplicate slot names, how does this
code handle PCI slots with that? I think some of the IBM machines had
non-hotplug slots named the same as the hotplug slots, right?

At one point, I had some code in there to stick the names of the
slots into a linked list and walk through it to try and detect
duplicate slot names, but after a few iterations, the cases I was
dealing with, it turned out to be easier to refcount them.

[my machines did not have colliding names between hp and non-hp
slots, it was more like seeing the same SxFy object appear
multiple times in the namespace and trying to create them
multiple times.]

It wasn't the IBM machine that was breaking; it was Fujitsu. They
were returning an error code (the numerical value 1023) when I
called the _SUN method on a slot object that existed in the ACPI
namespace but was not present (as reported by the _STA method).

By the time I got that error report, I'd already dropped the
duplicate name detection code, and was letting the kobject
infrastructure warn about duplicate names because for my test
cases, refcounting was a better solution.

[Kenji-san from Fujitsu seemed to be ok with the progress I'd
made at the time, he can speak up if he's changed his mind ;)]

But even that is not the error case you're describing, where
there is clear name collision of two physical slots in the
machine, one being hotplug, the other non-hotplug.

Maybe I would have to add some duplicate name detection code back
in there but...

This stuff needs a _lot_ of testing on a lot of different
machines, and a sane way to fall-back if there are errors to
ensure that working machines don't break.

And then there's the issue with userspace programs only expecting
hotplugable slots in the slots/ directory...

Yes -- totally agreed. And I'd like to see actual examples of
name collisions or userspace breakage to get a better idea of how
to handle real world problems rather than writing some crummy
code based on what my limited imagination can think of.

So how to get this test coverage? -mm? linux-next?

Thanks.

/ac





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/