Re: PROBLEM: modprobe hang at startup (3.8.x, 3.9.x, IBM x3550)

From: Jean Delvare
Date: Wed May 15 2013 - 15:49:39 EST


Robert,

On Wed, 15 May 2013 21:27:41 +1000, Robert Norris wrote:
> On Wed, May 15, 2013 at 11:20:44AM +0200, Jean Delvare wrote:
> > Can you share the full output of lspci -s 00:1f.3 -vv?
>
> 00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset SMBus Controller (rev 09)
> Subsystem: IBM Device 02dd
> Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
> Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> Interrupt: pin B routed to IRQ 0

Hmm, this "IRQ 0" is quite odd. I'm wondering if this could be the
reason for this hang. Was it with the i2c-i801 driver loaded, or
blacklisted? Please check if it makes a difference.

Do you see the same (and more generally, this issue) on one, some or
all of your x3550 servers?

Are you using IPMI on these machines?

> Region 4: I/O ports at 0440 [size=32]
>
> > I'm also curious if the SMBus controller shares its interrupt line
> > with another chip. /proc/interrupts should tell but you'll have to
> > make one of your systems hang again.
>
> I'm not sure how to read it, so here it is (3.9.2, immediately after
> boot, no options to i2c_i801):
>
> CPU0 CPU1 CPU2 CPU3
> (...)
> 20: 0 0 0 0 IO-APIC-fasteoi i801_smbus

Here the IRQ looks correct, and it isn't shared. But I am surprised
that the counters are all 0. If an SMBus transaction had been
attempted, there should be a 1 somewhere, even if the transaction
ultimately failed.

> (...)
> I went with blacklisting for now because this driver doesn't appear to
> be doing anything useful for us (sensors etc are working without it).
> I'll confess to not really knowing much about its purpose though.

It all depends on what I2C/SMBus slaves are connected to the SMBus.
Often there are the SPD EEPROMs from your memory modules, sometimes
with integrated thermal sensors (on DDR3 only - driver is jc42.) And in
your case a clock chip as well, for which IBM contributed a driver.

> > (...)
> > As far as debugging goes, please tell me if you have any I2C/SMBus
> > slave device driver loaded (check in /sys/bus/i2c/drivers.) Loading the
> > i2c-i801 driver doesn't do much on its own if there are no slave device
> > drivers using it.
>
> $ modprobe i2c-i801 disable_features=0x10
> $ dmesg | tail
> ...
> [28876.193408] i801_smbus 0000:00:1f.3: Interrupt disabled by user
> [28876.201168] ics932s401 4-0069: ics932s401 chip found
> $ ls /sys/bus/i2c/drivers
> dummy ics932s401

The dummy driver is a helper stub for i2c-core, it doesn't actually
access the SMBus. ics932s401 is for the clock chip, and I know clock
chips can be tricky and error prone. OTOH I can only guess that IBM had
a good reason to contribute the driver and make it auto-load on the
x3550.

I would appreciate if you could test the following:
* Blacklist i2c-i801 and ics932s401 so that none of them get
auto-loaded.
* Manually load i2c-i801 with interrupts enabled, and see what happens.
* If no hang happens, load i2c-dev, find the i801 bus number with
i2cdetect -l (from the i2c-tools package - it should be 4 according
to what you reported so far but there is no guarantee that it won't
change across reboots.) Then do a simple read from a random address
with:
# i2cget 4 0x50 0x00
(Adjust the bus number as needed.)
I am curious if this will hang as well or only when accessing the
clock chip at address 0x69.

Thanks,
--
Jean Delvare
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/