Re: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint

From: Hidetoshi Seto
Date: Thu Mar 01 2012 - 23:03:21 EST


(2012/03/02 3:28), Luck, Tony wrote:
>>> My concern is; on Sandy Bridge, is it safe to gather info about the DIMM
>>> location in/from machine check context in a reasonable time span?
>>
>> Well, what amd64_edac does is "buffer" the required lookup info so
>> whenever you get an error, you simply lookup the channel and chip select
>> - all ops which can be done in atomic context.
>
> Yes - we could pre-read all the config space registers ahead of time and
> save them in memory (none of the values should change - except if the platform
> supports hot-plug for memory). Total is only a few Kbytes. Then decode in
> machine check context is both safe, and fast.

To sort out my thought:

- First of all, OS gathers info about physical location of DIMMs from
DMI/ACPI/PCI etc., before enabling MCE mechanism.
- Make a kind of "physical memory location table" on memory buffer,
to ease mapping a physical address to the location of a DIMM module
and/or chip which have the memory cell pointed by the address.
- It would be better to have a well organized table rather than
having a raw copy of config space etc.
- Likewise it will also nice if we can map logical processor numbers
to the location of physical sockets on motherboard.
- Happy if user can refer the table via sysfs.
- Allow updating the table if the platform supports hot-plug.
- Once MCE is enabled, handler can refer the table on memory to
determine an erroneous device which should be replaced.

This storyline up to here is reasonable and acceptable, I think.

Then now it is clear that the last point where I feel uneasy about is
putting a string into the ring buffer instead of binary bits like index
of location table. Please use binary (or "binary + string") to tell
the error location to userland.


Thanks,
H.Seto

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/