Re: Phantom PMEM poison issue

From: Jane Chu
Date: Fri Jan 21 2022 - 19:40:36 EST


On 1/21/2022 4:31 PM, Jane Chu wrote:
> On baremetal Intel platform with DCPMEM installed and configured to
> provision daxfs, say a poison was consumed by a load from a user thread,
> and then daxfs takes action and clears the poison, confirmed by "ndctl
> -NM".
>
> Now, depends on the luck, after sometime(from a few seconds to 5+ hours)
> the ghost of the previous poison will surface, and it takes
> unload/reload the libnvdimm drivers in order to drive the phantom poison
> away, confirmed by ARS.
>
> Turns out, the issue is quite reproducible with the latest stable Linux.
>
> Here is the relevant console message after injected 8 poisons in one
> page via
> # ndctl inject-error namespace0.0 -n 2 -B 8210

There is a cut-n-paste error, the above line should be
"# ndctl inject-error namespace0.0 -n 8 -B 8210"

-jane

> then, cleared them all, and wait for 5+ hours, notice the time stamp.
> BTW, the system is idle otherwise.
>
> [ 2439.742296] mce: Uncorrected hardware memory error in user-access at
> 1850602400
> [ 2439.742420] Memory failure: 0x1850602: Sending SIGBUS to
> fsdax_poison_v1:8457 due to hardware memory corruption
> [ 2439.761866] Memory failure: 0x1850602: recovery action for dax page:
> Recovered
> [ 2439.769949] mce: [Hardware Error]: Machine check events logged
> -1850603000 uncached-minus<->write-back
> [ 2439.769984] x86/PAT: memtype_reserve failed [mem
> 0x1850602000-0x1850602fff], track uncached-minus, req uncached-minus
> [ 2439.769985] Could not invalidate pfn=0x1850602 from 1:1 map
> [ 2440.856351] x86/PAT: fsdax_poison_v1:8457 freeing invalid memtype
> [mem 0x1850602000-0x1850602fff]
>
> At this point,
> # ndctl list -NMu -r 0
> {
> "dev":"namespace0.0",
> "mode":"fsdax",
> "map":"dev",
> "size":"15.75 GiB (16.91 GB)",
> "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
> "sector_size":4096,
> "align":2097152,
> "blockdev":"pmem0"
> }
>
> [21351.992296] {2}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 1
> [21352.001528] {2}[Hardware Error]: event severity: recoverable
> [21352.007838] {2}[Hardware Error]: Error 0, type: recoverable
> [21352.014156] {2}[Hardware Error]: section_type: memory error
> [21352.020572] {2}[Hardware Error]: physical_address: 0x0000001850603200
> [21352.027958] {2}[Hardware Error]: physical_address_mask:
> 0xffffffffffffff00
> [21352.035827] {2}[Hardware Error]: node: 0 module: 1
> [21352.041466] {2}[Hardware Error]: DIMM location: /SYS/MB/P0 D6
> [21352.048277] Memory failure: 0x1850603: recovery action for dax page:
> Recovered
> [21352.056346] mce: [Hardware Error]: Machine check events logged
> [21352.056890] EDAC skx MC0: HANDLING MCE MEMORY ERROR
> [21352.056892] EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 255:
> 0xbc0000000000009f
> [21352.056894] EDAC skx MC0: TSC 0x0
> [21352.056895] EDAC skx MC0: ADDR 0x1850603200
> [21352.056897] EDAC skx MC0: MISC 0x8c
> [21352.056898] EDAC skx MC0: PROCESSOR 0:0x50656 TIME 1642758243 SOCKET
> 0 APIC 0x0
> [21352.056909] EDAC MC0: 1 UE memory read error on
> CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x1850603
> offset:0x200 grain:32 - err_code:0x0000:0x009f [..]
>
> And now,
>
> # ndctl list -NMu -r 0
> {
> "dev":"namespace0.0",
> "mode":"fsdax",
> "map":"dev",
> "size":"15.75 GiB (16.91 GB)",
> "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
> "sector_size":4096,
> "align":2097152,
> "blockdev":"pmem0",
> "badblock_count":1,
> "badblocks":[
> {
> "offset":8217,
> "length":1,
> "dimms":[
> "nmem0"
> ]
> }
> ]
> }
>
> According to my limited research, when ghes_proc_in_irq() is fired to
> report a delayed UE and it calls memory_failure() to take the page out
> and causes driver to record a badblock record, and that's how the
> phantom poison appeared.
>
> Note, 1 phantom poison for 8 injected poisons, so, not an accurate
> phantom representation.
>
> But that aside, it seems that the GHES mechanism and the synchronous MCE
> handling is totally at odds with each other, and that cannot be correct.
>
> What is the right thing to do to fix the issue? Should memory_failure
> handler second-guess the GHES report? Should the synchronous MCE
> handling mechanism manage to tell the firmware that so-and-so memory UE
> has been cleared and hence clear the record in firmware? Other ideas?
>
>
> Thanks!
> -jane