RE: Problems with EDAC coexisting with BIOS
From: Gross, Mark
Date: Fri Apr 21 2006 - 18:21:25 EST
>From: Doug Thompson [mailto:dthompson@xxxxxxxx]
>Sent: Friday, April 21, 2006 2:43 PM
>To: Gross, Mark
>Cc: Ong, Soo Keong; Carbonari, Steven; Wang, Zhenyu Z; bluesmoke-
>devel@xxxxxxxxxxxxxxxxxxxxx; Doug Thompson;
>Subject: RE: Problems with EDAC coexisting with BIOS
>On Fri, 2006-04-21 at 21:32 +0000, "Gross, Mark" wrote:
>> >-----Original Message-----
>> >From: Doug Thompson [mailto:dthompson@xxxxxxxx]
>> >Sent: Friday, April 21, 2006 1:57 PM
>> >To: Gross, Mark
>> >Cc: Ong, Soo Keong; Carbonari, Steven; Wang, Zhenyu Z; bluesmoke-
>> >devel@xxxxxxxxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
>> >Subject: Re: Problems with EDAC coexisting with BIOS
>> >Mark thanks for the informaton on this.
>> >Now the e752x_edac.c driver contains no direct calls to panic within
>> >itself. The edac_mc.c 'core' piece does, but only calls that if an
>> >error is found and panic on UE is enabled. (The other is a PCI
>> >panic, but doesn't effect this path). It might be possible that
>> >the hidden register was now hidden, the retrieval function returns
>> >garbage which falsely triggers the panic by the core.
>> The problem is that once the BIOS SMI re-hides the device reading
>> the chipset error registers returns 0xFFFFFFFF.
>Ok so the panic is caused by the driver because the UE bit is a one
>(somewhere in the 32-bit field), and then informs the CORE driver of
>false positive UE event, which does the panic trigger. I follow the
>The question then comes up: how to determine if the system has the
>hidden register feature.
For the E752x chipsets, if the Dev0:Fun1 is hidden after the BIOS hands
control to the OS, I think the driver should simply honor this and not
The test is to check if bit 5 of the DEVPRES1 resister is set. If set
the device is available, otherwise its hidden. (see section 3.5.41 of
>You mentioned that the probe should be refactored to look for this all
>ones return value as such an indicator, is that correct?
Yes, basically we should consider removing the un-hide code in
e752x_probe1, and handle the case where the dev0:fun1 is left hidden by
the BIOS as an error condition at driver probe time.
>But there is a window that the register is unhidden, the probe runs,
>detects the hardware it is looking at and registers itself. Then later
>the window closes, the register is hidden again and the panic can
>reoccurr as above. Thus, a check needs to be made in the poll operation
>for the all ones return value, notify the log of this situtation,
>shutdown future polling of this hardware and basicly become a no-op
You can never predict when a SMI will bubble through the system. Even
if you handle case where the BIOS re-hides Dev0:Fun1 and not panic how
do you deal with the race between the BIOS SMI based handling and the
driver? Who will end up reading (and clearing) the error registers
first? There is no good way to share today.
>Does that fix cover the issue as you have discovered it? Or is there
>more, or another fix you see?
No, you should return an error from e752x_probe1 if bit5 is cleared
already. To do anything else you are assuming that you are privileged
to have more knowledge about what the BIOS is doing than you should.
I'll send the patch to e752x_edac.c this weekend.
>> >I would like to see the panic output and stack trace to see where
>> >panic came from. It might have come from the PCI device access
>> >when it was trying to access the now hidden register.
>> >can you post that panice information?
>> Yes, but the edac_mc.c call to panic doesn't drop any data to the
>> The following is the output from an instrumentation we did to the
>> that went into RHEL4-U3 :
>> INIT: version 2.85 booting
>> Welcome to Red Hat Enterprise Linux AS
>> Press 'I' to enter interactive startup.
>> Starting udev: [ OK ]
>> Initializing hardware... storage network audioIn probe1, before
>> E752X_DEVPRES1 = 0x10
>> In probe1, after write
>> E752X_DEVPRES1 = 0x30
>> Snip .....
>> Configuring kernel parameters: [ OK ]
>> Setting clock (localtime): Sun Mar 12 00:52:17 PST 2006 [ OK ]
>> Setting hostname localhost.localdomain: [ OK ]
>> Checking root filesystem
>> [/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/sda1
>> /: clean, 365954/8830976 files, 2357630/17657443 blocks
>> [ OK ]
>> Remounting root filesystem in read-write mode: [ OK ]
>> No devices found
>> No Software RAID disks
>> Setting up Logical Volume ManageE752X_DEVPRES1 = 0x10
>> ment: E752X_DEVPRES1 = 0x10
>> Kernel panic - not syncing: MC0: Uncorrected Error
>> that's all you get, no stack trace.
>> The EX752C_DEVPRES1 = ... debug statements came from sprinkling call
>> the following debug function we instrumented into the EDAC driver
>> static void print_error_info(struct pci_dev *pdev)
>> u8 stat8;
>> pci_read_config_byte(pdev, E752X_DEVPRES1, &stat8);
>> printk(KERN_EMERG "E752X_DEVPRES1 = 0x%02X\n", stat8);
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/