Phantom PMEM poison issue

From: Jane Chu
Date: Fri Jan 21 2022 - 19:31:33 EST


On baremetal Intel platform with DCPMEM installed and configured to
provision daxfs, say a poison was consumed by a load from a user thread,
and then daxfs takes action and clears the poison, confirmed by "ndctl
-NM".

Now, depends on the luck, after sometime(from a few seconds to 5+ hours)
the ghost of the previous poison will surface, and it takes
unload/reload the libnvdimm drivers in order to drive the phantom poison
away, confirmed by ARS.

Turns out, the issue is quite reproducible with the latest stable Linux.

Here is the relevant console message after injected 8 poisons in one
page via
# ndctl inject-error namespace0.0 -n 2 -B 8210
then, cleared them all, and wait for 5+ hours, notice the time stamp.
BTW, the system is idle otherwise.

[ 2439.742296] mce: Uncorrected hardware memory error in user-access at
1850602400
[ 2439.742420] Memory failure: 0x1850602: Sending SIGBUS to
fsdax_poison_v1:8457 due to hardware memory corruption
[ 2439.761866] Memory failure: 0x1850602: recovery action for dax page:
Recovered
[ 2439.769949] mce: [Hardware Error]: Machine check events logged
-1850603000 uncached-minus<->write-back
[ 2439.769984] x86/PAT: memtype_reserve failed [mem
0x1850602000-0x1850602fff], track uncached-minus, req uncached-minus
[ 2439.769985] Could not invalidate pfn=0x1850602 from 1:1 map
[ 2440.856351] x86/PAT: fsdax_poison_v1:8457 freeing invalid memtype
[mem 0x1850602000-0x1850602fff]

At this point,
# ndctl list -NMu -r 0
{
"dev":"namespace0.0",
"mode":"fsdax",
"map":"dev",
"size":"15.75 GiB (16.91 GB)",
"uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
"sector_size":4096,
"align":2097152,
"blockdev":"pmem0"
}

[21351.992296] {2}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 1
[21352.001528] {2}[Hardware Error]: event severity: recoverable
[21352.007838] {2}[Hardware Error]: Error 0, type: recoverable
[21352.014156] {2}[Hardware Error]: section_type: memory error
[21352.020572] {2}[Hardware Error]: physical_address: 0x0000001850603200
[21352.027958] {2}[Hardware Error]: physical_address_mask:
0xffffffffffffff00
[21352.035827] {2}[Hardware Error]: node: 0 module: 1
[21352.041466] {2}[Hardware Error]: DIMM location: /SYS/MB/P0 D6
[21352.048277] Memory failure: 0x1850603: recovery action for dax page:
Recovered
[21352.056346] mce: [Hardware Error]: Machine check events logged
[21352.056890] EDAC skx MC0: HANDLING MCE MEMORY ERROR
[21352.056892] EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 255:
0xbc0000000000009f
[21352.056894] EDAC skx MC0: TSC 0x0
[21352.056895] EDAC skx MC0: ADDR 0x1850603200
[21352.056897] EDAC skx MC0: MISC 0x8c
[21352.056898] EDAC skx MC0: PROCESSOR 0:0x50656 TIME 1642758243 SOCKET
0 APIC 0x0
[21352.056909] EDAC MC0: 1 UE memory read error on
CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x1850603
offset:0x200 grain:32 - err_code:0x0000:0x009f [..]

And now,

# ndctl list -NMu -r 0
{
"dev":"namespace0.0",
"mode":"fsdax",
"map":"dev",
"size":"15.75 GiB (16.91 GB)",
"uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
"sector_size":4096,
"align":2097152,
"blockdev":"pmem0",
"badblock_count":1,
"badblocks":[
{
"offset":8217,
"length":1,
"dimms":[
"nmem0"
]
}
]
}

According to my limited research, when ghes_proc_in_irq() is fired to
report a delayed UE and it calls memory_failure() to take the page out
and causes driver to record a badblock record, and that's how the
phantom poison appeared.

Note, 1 phantom poison for 8 injected poisons, so, not an accurate
phantom representation.

But that aside, it seems that the GHES mechanism and the synchronous MCE
handling is totally at odds with each other, and that cannot be correct.

What is the right thing to do to fix the issue? Should memory_failure
handler second-guess the GHES report? Should the synchronous MCE
handling mechanism manage to tell the firmware that so-and-so memory UE
has been cleared and hence clear the record in firmware? Other ideas?


Thanks!
-jane