Re: PROBLEM: modprobe hang at startup (3.8.x, 3.9.x, IBM x3550)

From: Martin Mokrejs
Date: Fri May 17 2013 - 05:22:26 EST


Hi,
while you are chasing some problem with i2c_801 I would like to mention
that I never got an answer on the thread https://lkml.org/lkml/2013/1/23/405
about a kmemleak reported by kernel . Maybe this could give you a hint?
If these do not overlap I would be anyways glad to receive an answer via
the original thread I have started.
Thank you,
Martin

Jean Delvare wrote:
> Hi Robert,
>
> On Thu, 16 May 2013 13:44:55 +1000, Robert Norris wrote:
>> On Wed, May 15, 2013 at 09:49:23PM +0200, Jean Delvare wrote:
>>>> Interrupt: pin B routed to IRQ 0
>>>
>>> Hmm, this "IRQ 0" is quite odd. I'm wondering if this could be the
>>> reason for this hang. Was it with the i2c-i801 driver loaded, or
>>> blacklisted? Please check if it makes a difference.
>>
>> That was without the driver loaded (blacklisted). After loading (with
>> interrupts enabled) we get:
>>
>> Interrupt: pin B routed to IRQ 20
>
> For the record, I also see the IRQ value change after loading the
> i2c-i801 driver on my system (with an ICH10 south bridge.) From 14 to
> 22 in my case. So it's a bit different (no IRQ 0) but not still
> somewhat similar, so I'm still not sure if this has anything to do with
> your issue.
>
>>
>>> Do you see the same (and more generally, this issue) on one, some or
>>> all of your x3550 servers?
>>
>> The issue has occured on at least three x3550s (we have 11). I haven't
>> tested more, because knowingly crashing production machines sucks.
>
> Yes of course, I understand, I did not expect you to do that ;)
>
>> This appears to be the case on other machines. With the module
>> blacklisted (never loaded), lspci shows IRQ 0. After load, IRQ 20.
>> (tested on 3.4 and 3.9).
>
> OK.
>
>>> Are you using IPMI on these machines?
>>
>> Yes, but only for monitoring/sensors, if that makes a difference.
>
> IPMI is still likely to access the SMBus controller. If there's a BMC
> in the machine, it can also access the SMBus slave with its own
> controller. It would be good to rule this out by disabling IPMI
> completely, removing the BMC from the machine if it has one, and
> checking if it makes the issue go away or not.
>
>>> I would appreciate if you could test the following:
>>> * Blacklist i2c-i801 and ics932s401 so that none of them get
>>> auto-loaded.
>>
>> Done.
>>
>>> * Manually load i2c-i801 with interrupts enabled, and see what
>>> happens.
>>
>> Returned immediately:
>>
>> [ 60.527140] i801_smbus 0000:00:1f.3: SMBus using PCI Interrupt
>
> This confirms that the i2c-i801 driver loading itself isn't the problem.
>
>>> * If no hang happens, load i2c-dev, find the i801 bus number with
>>> i2cdetect -l (from the i2c-tools package - it should be 4 according
>>> to what you reported so far but there is no guarantee that it won't
>>> change across reboots.)
>>
>> $ i2cdetect -l
>> i2c-0 i2c Radeon i2c bit bus DVI_DDC I2C adapter
>> i2c-1 i2c Radeon i2c bit bus VGA_DDC I2C adapter
>> i2c-2 i2c Radeon i2c bit bus MONID I2C adapter
>> i2c-3 i2c Radeon i2c bit bus CRT2_DDC I2C adapter
>> i2c-4 smbus SMBus I801 adapter at 0440 SMBus adapter
>>
>>> Then do a simple read from a random address
>>> with:
>>> # i2cget 4 0x50 0x00
>>> (Adjust the bus number as needed.)
>>> I am curious if this will hang as well or only when accessing the
>>> clock chip at address 0x69.
>>
>> Yep, that one hangs. The hung task handler picked it up after a few
>> minutes.
>
> OK, this means that any transaction request to the SMBus controller
> causes the hang.
>
> The i2c-i801 driver is optimistically using wait_event() when waiting
> for an interrupt to arrive. I suppose that the interrupt is never
> delivered in your case (all 0 in /proc/interrupts.)
>
> Daniel, shouldn't we use wait_event_timeout() instead to catch issues
> like this and fail cleanly? Maybe even fallback to polling
> automatically?
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/