RE: [PATCH] acpi/nfit: badrange report spill over to clean range

From: Dan Williams
Date: Tue Jul 12 2022 - 20:48:28 EST


Jane Chu wrote:
> Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison
> granularity") changed nfit_handle_mce() callback to report badrange for
> each poison at an alignment indicated by 1ULL << MCI_MISC_ADDR_LSB(mce->misc)
> instead of the hardcoded L1_CACHE_BYTES. However recently on a server
> populated with Intel DCPMEM v2 dimms, it appears that
> 1UL << MCI_MISC_ADDR_LSB(mce->misc) turns out is 4KiB, or 8 512-byte blocks.
> Consequently, injecting 2 back-to-back poisons via ndctl, and it reports
> 8 poisons.
>
> [29076.590281] {3}[Hardware Error]: physical_address: 0x00000040a0602400
> [..]
> [29076.619447] Memory failure: 0x40a0602: recovery action for dax page: Recovered
> [29076.627519] mce: [Hardware Error]: Machine check events logged
> [29076.634033] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
> [29076.648805] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
> [..]
> [29078.634817] {4}[Hardware Error]: physical_address: 0x00000040a0602600
> [..]
> [29079.595327] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
> [29079.610106] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
> [..]
> {
> "dev":"namespace0.0",
> "mode":"fsdax",
> "map":"dev",
> "size":33820770304,
> "uuid":"a1b0f07f-747f-40a8-bcd4-de1560a1ef75",
> "sector_size":512,
> "align":2097152,
> "blockdev":"pmem0",
> "badblock_count":8,
> "badblocks":[
> {
> "offset":8208,
> "length":8,
> "dimms":[
> "nmem0"
> ]
> }
> ]
> }
>
> So, 1UL << MCI_MISC_ADDR_LSB(mce->misc) is an unreliable indicator for poison
> radius and shouldn't be used. More over, as each injected poison is being
> reported independently, any alignment under 512-byte appear works:
> L1_CACHE_BYTES (though inaccurate), or 256-bytes (as ars->length reports),
> or 512-byte.
>
> To get around this issue, 512-bytes is chosen as the alignment because
> a. it happens to be the badblock granularity,
> b. ndctl inject-error cannot inject more than one poison to a 512-byte block,
> c. architecture agnostic

I am failing to see the kernel bug? Yes, you injected less than 8
"badblocks" of poison and the hardware reported 8 blocks of poison, but
that's not the kernel's fault, that's the hardware. What happens when
hardware really does detect 8 blocks of consective poison and this
implementation decides to only record 1 at a time?

It seems the fix you want is for the hardware to report the precise
error bounds and that 1UL << MCI_MISC_ADDR_LSB(mce->misc) does not have
that precision in this case.

However, the ARS engine likely can return the precise error ranges so I
think the fix is to just use the address range indicated by 1UL <<
MCI_MISC_ADDR_LSB(mce->misc) to filter the results from a short ARS
scrub request to ask the device for the precise error list.