Re: Phantom PMEM poison issue

From: Luck, Tony
Date: Fri Jan 21 2022 - 20:27:09 EST


On Sat, Jan 22, 2022 at 12:40:18AM +0000, Jane Chu wrote:
> On 1/21/2022 4:31 PM, Jane Chu wrote:
> > On baremetal Intel platform with DCPMEM installed and configured to
> > provision daxfs, say a poison was consumed by a load from a user thread,
> > and then daxfs takes action and clears the poison, confirmed by "ndctl
> > -NM".
> >
> > Now, depends on the luck, after sometime(from a few seconds to 5+ hours)
> > the ghost of the previous poison will surface, and it takes
> > unload/reload the libnvdimm drivers in order to drive the phantom poison
> > away, confirmed by ARS.
> >
> > Turns out, the issue is quite reproducible with the latest stable Linux.
> >
> > Here is the relevant console message after injected 8 poisons in one
> > page via
> > # ndctl inject-error namespace0.0 -n 2 -B 8210
>
> There is a cut-n-paste error, the above line should be
> "# ndctl inject-error namespace0.0 -n 8 -B 8210"

You say "in one page" here. What is the page size?
>
> -jane
>
> > then, cleared them all, and wait for 5+ hours, notice the time stamp.
> > BTW, the system is idle otherwise.
> >
> > [ 2439.742296] mce: Uncorrected hardware memory error in user-access at
> > 1850602400
> > [ 2439.742420] Memory failure: 0x1850602: Sending SIGBUS to
> > fsdax_poison_v1:8457 due to hardware memory corruption
> > [ 2439.761866] Memory failure: 0x1850602: recovery action for dax page:
> > Recovered
> > [ 2439.769949] mce: [Hardware Error]: Machine check events logged
> > -1850603000 uncached-minus<->write-back
> > [ 2439.769984] x86/PAT: memtype_reserve failed [mem
> > 0x1850602000-0x1850602fff], track uncached-minus, req uncached-minus
> > [ 2439.769985] Could not invalidate pfn=0x1850602 from 1:1 map
> > [ 2440.856351] x86/PAT: fsdax_poison_v1:8457 freeing invalid memtype
> > [mem 0x1850602000-0x1850602fff]

This error is reported in PFN=1850602 (at offset 0x400 = 1K)

> >
> > At this point,
> > # ndctl list -NMu -r 0
> > {
> > "dev":"namespace0.0",
> > "mode":"fsdax",
> > "map":"dev",
> > "size":"15.75 GiB (16.91 GB)",
> > "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
> > "sector_size":4096,
> > "align":2097152,
> > "blockdev":"pmem0"
> > }
> >
> > [21351.992296] {2}[Hardware Error]: Hardware error from APEI Generic
> > Hardware Error Source: 1
> > [21352.001528] {2}[Hardware Error]: event severity: recoverable
> > [21352.007838] {2}[Hardware Error]: Error 0, type: recoverable
> > [21352.014156] {2}[Hardware Error]: section_type: memory error
> > [21352.020572] {2}[Hardware Error]: physical_address: 0x0000001850603200

This error is in the following page: PFN=1850603 (at offset 0x200 = 512b)

Is that what you mean by "phantom error" ... from a different
address from those that were injected?

-Tony