RE: Phantom PMEM poison issue

From: Tsaur, Erwin
Date: Fri Jan 21 2022 - 20:52:00 EST


Hi Jane,

Is phantom error, an poison that was injected and then cleared, but somehow shows up again?
How is "daxfs takes acation and clears the poison" by doing mailbox or writes?
Also how are you doing ARS?

Erwin

-----Original Message-----
From: Luck, Tony <tony.luck@xxxxxxxxx>
Sent: Friday, January 21, 2022 5:27 PM
To: chu, jane <jane.chu@xxxxxxxxxx>
Cc: Williams, Dan J <dan.j.williams@xxxxxxxxx>; bp@xxxxxxxxx >> Borislav Petkov <bp@xxxxxxxxx>; djwong@xxxxxxxxxx; willy@xxxxxxxxxxxxx; nvdimm@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
Subject: Re: Phantom PMEM poison issue

On Sat, Jan 22, 2022 at 12:40:18AM +0000, Jane Chu wrote:
> On 1/21/2022 4:31 PM, Jane Chu wrote:
> > On baremetal Intel platform with DCPMEM installed and configured to
> > provision daxfs, say a poison was consumed by a load from a user
> > thread, and then daxfs takes action and clears the poison, confirmed
> > by "ndctl -NM".
> >
> > Now, depends on the luck, after sometime(from a few seconds to 5+
> > hours) the ghost of the previous poison will surface, and it takes
> > unload/reload the libnvdimm drivers in order to drive the phantom
> > poison away, confirmed by ARS.
> >
> > Turns out, the issue is quite reproducible with the latest stable Linux.
> >
> > Here is the relevant console message after injected 8 poisons in one
> > page via
> > # ndctl inject-error namespace0.0 -n 2 -B 8210
>
> There is a cut-n-paste error, the above line should be
> "# ndctl inject-error namespace0.0 -n 8 -B 8210"

You say "in one page" here. What is the page size?
>
> -jane
>
> > then, cleared them all, and wait for 5+ hours, notice the time stamp.
> > BTW, the system is idle otherwise.
> >
> > [ 2439.742296] mce: Uncorrected hardware memory error in user-access
> > at
> > 1850602400
> > [ 2439.742420] Memory failure: 0x1850602: Sending SIGBUS to
> > fsdax_poison_v1:8457 due to hardware memory corruption [
> > 2439.761866] Memory failure: 0x1850602: recovery action for dax page:
> > Recovered
> > [ 2439.769949] mce: [Hardware Error]: Machine check events logged
> > -1850603000 uncached-minus<->write-back [ 2439.769984] x86/PAT:
> > memtype_reserve failed [mem 0x1850602000-0x1850602fff], track
> > uncached-minus, req uncached-minus [ 2439.769985] Could not
> > invalidate pfn=0x1850602 from 1:1 map [ 2440.856351] x86/PAT:
> > fsdax_poison_v1:8457 freeing invalid memtype [mem
> > 0x1850602000-0x1850602fff]

This error is reported in PFN=1850602 (at offset 0x400 = 1K)

> >
> > At this point,
> > # ndctl list -NMu -r 0
> > {
> > "dev":"namespace0.0",
> > "mode":"fsdax",
> > "map":"dev",
> > "size":"15.75 GiB (16.91 GB)",
> > "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
> > "sector_size":4096,
> > "align":2097152,
> > "blockdev":"pmem0"
> > }
> >
> > [21351.992296] {2}[Hardware Error]: Hardware error from APEI Generic
> > Hardware Error Source: 1 [21352.001528] {2}[Hardware Error]: event
> > severity: recoverable [21352.007838] {2}[Hardware Error]: Error 0,
> > type: recoverable
> > [21352.014156] {2}[Hardware Error]: section_type: memory error
> > [21352.020572] {2}[Hardware Error]: physical_address: 0x0000001850603200

This error is in the following page: PFN=1850603 (at offset 0x200 = 512b)

Is that what you mean by "phantom error" ... from a different address from those that were injected?

-Tony