Re: [PATCH] mm/hwpoison: fix race between soft_offline_page and unpoison_memory

From: Wanpeng Li
Date: Fri Aug 14 2015 - 03:59:32 EST


On 8/14/15 3:54 PM, Wanpeng Li wrote:
[...]
OK, then I rethink of handling the race in unpoison_memory().

Currently properly contained/hwpoisoned pages should have page refcount 1
(when the memory error hits LRU pages or hugetlb pages) or refcount 0
(when the memory error hits the buddy page.) And current unpoison_memory()
implicitly assumes this because otherwise the unpoisoned page has no place
to go and it's just leaked.
So to avoid the kernel panic, adding prechecks of refcount and mapcount
to limit the page to unpoison for only unpoisonable pages looks OK to me.
The page under soft offlining always has refcount >=2 and/or mapcount > 0,
so such pages should be filtered out.

Here's a patch. In my testing (run soft offline stress testing then repeat
unpoisoning in background,) the reported (or similar) bug doesn't happen.
Can I have your comments?
As page_action() prints out page maybe still referenced by some users,
however, PageHWPoison has already set. So you will leak many poison pages.


Anyway, the bug is still there.

[ 944.387559] BUG: Bad page state in process expr pfn:591e3
[ 944.393053] page:ffffea00016478c0 count:-1 mapcount:0 mapping: (null) index:0x2
[ 944.401147] flags: 0x1fffff80000000()
[ 944.404819] page dumped because: nonzero _count

Regards,
Wanpeng Li
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/