Re: mlx4 BUG_ON in probe path

From: Yishai Hadas
Date: Thu Nov 17 2016 - 05:24:29 EST


On 11/16/2016 8:25 PM, Bjorn Helgaas wrote:
Hi Yishai,

Johannes has been working on an mlx4 initialization problem on an
IBM x3850 X6. The underlying problem is a PCI core issue -- we're
setting RCB in the Mellanox device, which means it thinks it can
generate 128-byte Completions, even though the Root Port above it
can't handle them. That issue is
https://bugzilla.kernel.org/show_bug.cgi?id=187781

The machine crashed when this happened, apparently not because of any
error reported via AER, but because mlx4 contains a BUG_ON, probably
the one in mlx4_enter_error_state().

That one happens if pci_channel_offline() returns false. Is this
telling us about a problem in PCI error handling, or is it just a case
where mlx4 isn't as smart as it could be?

Yes, we expect at that step a problem/bug in the PCI layer that should be fixed (e.g. reporting online but really is offline, etc.), can you please evaluate and confirm that ?