Re: Phantom PMEM poison issue

From: Jane Chu
Date: Fri Jan 21 2022 - 21:33:06 EST


On 1/21/2022 5:51 PM, Tsaur, Erwin wrote:
> Hi Jane,
>
> Is phantom error, an poison that was injected and then cleared, but somehow shows up again?
> How is "daxfs takes acation and clears the poison" by doing mailbox or writes?
> Also how are you doing ARS?

The phantom show up as soon as this console message show up
[Hardware Error]: Hardware error from APEI Generic Hardware Error
Source: 1
from 'ghes'.

The poisons were clear via pmem_clear_poison().

ARS was run as
"ndctl start-scrub; ndctl wait-scrub -p 30"

thanks,
-jane


>
> Erwin
>
> -----Original Message-----
> From: Luck, Tony <tony.luck@xxxxxxxxx>
> Sent: Friday, January 21, 2022 5:27 PM
> To: chu, jane <jane.chu@xxxxxxxxxx>
> Cc: Williams, Dan J <dan.j.williams@xxxxxxxxx>; bp@xxxxxxxxx >> Borislav Petkov <bp@xxxxxxxxx>; djwong@xxxxxxxxxx; willy@xxxxxxxxxxxxx; nvdimm@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
> Subject: Re: Phantom PMEM poison issue
>
> On Sat, Jan 22, 2022 at 12:40:18AM +0000, Jane Chu wrote:
>> On 1/21/2022 4:31 PM, Jane Chu wrote:
>>> On baremetal Intel platform with DCPMEM installed and configured to
>>> provision daxfs, say a poison was consumed by a load from a user
>>> thread, and then daxfs takes action and clears the poison, confirmed
>>> by "ndctl -NM".
>>>
>>> Now, depends on the luck, after sometime(from a few seconds to 5+
>>> hours) the ghost of the previous poison will surface, and it takes
>>> unload/reload the libnvdimm drivers in order to drive the phantom
>>> poison away, confirmed by ARS.
>>>
>>> Turns out, the issue is quite reproducible with the latest stable Linux.
>>>
>>> Here is the relevant console message after injected 8 poisons in one
>>> page via
>>> # ndctl inject-error namespace0.0 -n 2 -B 8210
>>
>> There is a cut-n-paste error, the above line should be
>> "# ndctl inject-error namespace0.0 -n 8 -B 8210"
>
> You say "in one page" here. What is the page size?
>>
>> -jane
>>
>>> then, cleared them all, and wait for 5+ hours, notice the time stamp.
>>> BTW, the system is idle otherwise.
>>>
>>> [ 2439.742296] mce: Uncorrected hardware memory error in user-access
>>> at
>>> 1850602400
>>> [ 2439.742420] Memory failure: 0x1850602: Sending SIGBUS to
>>> fsdax_poison_v1:8457 due to hardware memory corruption [
>>> 2439.761866] Memory failure: 0x1850602: recovery action for dax page:
>>> Recovered
>>> [ 2439.769949] mce: [Hardware Error]: Machine check events logged
>>> -1850603000 uncached-minus<->write-back [ 2439.769984] x86/PAT:
>>> memtype_reserve failed [mem 0x1850602000-0x1850602fff], track
>>> uncached-minus, req uncached-minus [ 2439.769985] Could not
>>> invalidate pfn=0x1850602 from 1:1 map [ 2440.856351] x86/PAT:
>>> fsdax_poison_v1:8457 freeing invalid memtype [mem
>>> 0x1850602000-0x1850602fff]
>
> This error is reported in PFN=1850602 (at offset 0x400 = 1K)
>
>>>
>>> At this point,
>>> # ndctl list -NMu -r 0
>>> {
>>> "dev":"namespace0.0",
>>> "mode":"fsdax",
>>> "map":"dev",
>>> "size":"15.75 GiB (16.91 GB)",
>>> "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
>>> "sector_size":4096,
>>> "align":2097152,
>>> "blockdev":"pmem0"
>>> }
>>>
>>> [21351.992296] {2}[Hardware Error]: Hardware error from APEI Generic
>>> Hardware Error Source: 1 [21352.001528] {2}[Hardware Error]: event
>>> severity: recoverable [21352.007838] {2}[Hardware Error]: Error 0,
>>> type: recoverable
>>> [21352.014156] {2}[Hardware Error]: section_type: memory error
>>> [21352.020572] {2}[Hardware Error]: physical_address: 0x0000001850603200
>
> This error is in the following page: PFN=1850603 (at offset 0x200 = 512b)
>
> Is that what you mean by "phantom error" ... from a different address from those that were injected?
>
> -Tony
>