Re: Phantom PMEM poison issue

From: Jane Chu
Date: Fri Jan 21 2022 - 21:29:53 EST


On 1/21/2022 5:27 PM, Luck, Tony wrote:
> On Sat, Jan 22, 2022 at 12:40:18AM +0000, Jane Chu wrote:
>> On 1/21/2022 4:31 PM, Jane Chu wrote:
>>> On baremetal Intel platform with DCPMEM installed and configured to
>>> provision daxfs, say a poison was consumed by a load from a user thread,
>>> and then daxfs takes action and clears the poison, confirmed by "ndctl
>>> -NM".
>>>
>>> Now, depends on the luck, after sometime(from a few seconds to 5+ hours)
>>> the ghost of the previous poison will surface, and it takes
>>> unload/reload the libnvdimm drivers in order to drive the phantom poison
>>> away, confirmed by ARS.
>>>
>>> Turns out, the issue is quite reproducible with the latest stable Linux.
>>>
>>> Here is the relevant console message after injected 8 poisons in one
>>> page via
>>> # ndctl inject-error namespace0.0 -n 2 -B 8210
>>
>> There is a cut-n-paste error, the above line should be
>> "# ndctl inject-error namespace0.0 -n 8 -B 8210"
>
> You say "in one page" here. What is the page size?

The page size is 4K, the size of base page on x86.
I said "one page", as 8 (poisons) * 256B = 2KiB, only half page.

>>
>> -jane
>>
>>> then, cleared them all, and wait for 5+ hours, notice the time stamp.
>>> BTW, the system is idle otherwise.
>>>
>>> [ 2439.742296] mce: Uncorrected hardware memory error in user-access at
>>> 1850602400
>>> [ 2439.742420] Memory failure: 0x1850602: Sending SIGBUS to
>>> fsdax_poison_v1:8457 due to hardware memory corruption
>>> [ 2439.761866] Memory failure: 0x1850602: recovery action for dax page:
>>> Recovered
>>> [ 2439.769949] mce: [Hardware Error]: Machine check events logged
>>> -1850603000 uncached-minus<->write-back
>>> [ 2439.769984] x86/PAT: memtype_reserve failed [mem
>>> 0x1850602000-0x1850602fff], track uncached-minus, req uncached-minus
>>> [ 2439.769985] Could not invalidate pfn=0x1850602 from 1:1 map
>>> [ 2440.856351] x86/PAT: fsdax_poison_v1:8457 freeing invalid memtype
>>> [mem 0x1850602000-0x1850602fff]
>
> This error is reported in PFN=1850602 (at offset 0x400 = 1K)

yes.
>
>>>
>>> At this point,
>>> # ndctl list -NMu -r 0
>>> {
>>> "dev":"namespace0.0",
>>> "mode":"fsdax",
>>> "map":"dev",
>>> "size":"15.75 GiB (16.91 GB)",
>>> "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
>>> "sector_size":4096,
>>> "align":2097152,
>>> "blockdev":"pmem0"
>>> }
>>>
>>> [21351.992296] {2}[Hardware Error]: Hardware error from APEI Generic
>>> Hardware Error Source: 1
>>> [21352.001528] {2}[Hardware Error]: event severity: recoverable
>>> [21352.007838] {2}[Hardware Error]: Error 0, type: recoverable
>>> [21352.014156] {2}[Hardware Error]: section_type: memory error
>>> [21352.020572] {2}[Hardware Error]: physical_address: 0x0000001850603200
>
> This error is in the following page: PFN=1850603 (at offset 0x200 = 512b)
>

I see, this is the next page... the issue is reproducible with
a single poison injection.

> Is that what you mean by "phantom error" ... from a different
> address from those that were injected?

All 8 poisons were cleared by the driver via DSM, and verified
by "ndctl -NMu -r 0", that covers every page in the 16GiB /dev/pmem.

It's phantom because unload->reload libnvdimm, followed by a full ARS
scan confirms the poison isn't there, hence phantom.

thanks,
-jane

>
> -Tony