Re: [syzbot] [mm?] kernel BUG in try_to_unmap_one (2)

From: Jinjiang Tu
Date: Fri Jun 06 2025 - 21:29:55 EST

Next message: kernel test robot: "[gustavoars:testing/wfamnae-next20250606 16/16] kernel/sched/ext.c:3748:24: error: returning 'struct cgroup_hdr *' from a function with incompatible return type 'struct cgroup *'"
Previous message: Brian Gerst: "Re: [PATCH v2 45/62] x86/extable: Define ELF section entry size for exception tables"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

在 2025/6/6 15:56, David Hildenbrand 写道:

On 05.06.25 09:18, Jinjiang Tu wrote:

在 2025/6/5 14:37, David Hildenbrand 写道:

On 05.06.25 08:27, David Hildenbrand wrote:

On 05.06.25 08:11, David Hildenbrand wrote:

On 05.06.25 07:38, syzbot wrote:

Hello,

syzbot found the following issue on:

HEAD commit:    d7fa1af5b33e Merge branch 'for-next/core' into
for-kernelci

Hmmm, another very odd page-table mapping related problem on that tree
found on arm64 only:

In this particular reproducer we seem to be having MADV_HUGEPAGE and
io_uring_setup() be racing with MADV_HWPOISON, MADV_PAGEOUT and
io_uring_register(IORING_REGISTER_BUFFERS).

I assume the issue is related to MADV_HWPOISON, MADV_PAGEOUT and
io_uring_register racing, only. I suspect MADV_HWPOISON is trying to
split a THP, while MADV_PAGEOUT tries paging it out.

IORING_REGISTER_BUFFERS ends up in
io_sqe_buffers_register->io_sqe_buffer_register where we GUP-fast and
try coalescing buffers.

And something about THPs is not particularly happy :)

Not sure if realted to io_uring.

unmap_poisoned_folio() calls try_to_unmap() without TTU_SPLIT_HUGE_PMD.

When called from memory_failure(), we make sure to never call it on a
large folio: WARN_ON(folio_test_large(folio));

However, from shrink_folio_list() we might call unmap_poisoned_folio()
on a large folio, which doesn't work if it is still PMD-mapped. Maybe
passing TTU_SPLIT_HUGE_PMD would fix it.

TTU_SPLIT_HUGE_PMD only converts the PMD-mapped THP to PTE-mapped THP, and may trigger the below WARN_ON_ONCE in try_to_unmap_one.

    if (PageHWPoison(subpage) && (flags & TTU_HWPOISON)) {
        ...
    } else if (likely(pte_present(pteval)) && pte_unused(pteval) &&
        !userfaultfd_armed(vma)) {
         ....
    } else if (folio_test_anon(folio)) {
        swp_entry_t entry = page_swap_entry(subpage);
        pte_t swp_pte;
        /*
         * Store the swap location in the pte.
         * See handle_pte_fault() ...
        */
        if (unlikely(folio_test_swapbacked(folio) !=
            folio_test_swapcache(folio))) {
            WARN_ON_ONCE(1);          // here. if the subpage isn't hwposioned, and we hasn't call add_to_swap() for the THP
            goto walk_abort;
         }

This makes me wonder if we should start splitting up try_to_unmap(), to handle the individual cases more cleanly at some point ...

Maybe for now something like:

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index b91a33fb6c694..995486a3ff4d2 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1566,6 +1566,14 @@ int unmap_poisoned_folio(struct folio *folio, unsigned long pfn, bool must_kill)
        enum ttu_flags ttu = TTU_IGNORE_MLOCK | TTU_SYNC | TTU_HWPOISON;
        struct address_space *mapping;

+       /*
+        * try_to_unmap() cannot deal with some subpages of an anon folio
+        * not being hwpoisoned: we cannot unmap them without swap.
+        */
+       if (folio_test_large(folio) && !folio_test_hugetlb(folio) &&
+           folio_test_anon(folio) && !folio_test_swapcache(folio))
+               return -EBUSY;
+

If the THP is in swapcache, we also have to split PMD-mapped to PTE-mapped first.

if (folio_test_swapcache(folio)) {
pr_err("%#lx: keeping poisoned page in swap cache\n", pfn);
ttu &= ~TTU_HWPOISON;

If we want to unmap in shrink_folio_list, we have to try_to_split_thp_page() like memory_failure(). But it't too complicated, maybe just skip the
hwpoisoned folio is enough? If the folio is accessed again, memory_failure will be trigerred again and kill the accessing process since the folio
has be hwpoisoned.

Maybe we should try splitting in there? But staring at shrink_folio_list(), not that easy.

We could return -E2BIG and let the caller try splitting, to then retry.

Since UCE is rare in real world, and could race with any subsystem, which is more race. Taking too much time to handle UCE in other subsystem is
meaningless and complicated. Just skipping is enough. memory_failure() will handle it if the UCE is trigerred again.

CC Miaohe Lin

Next message: kernel test robot: "[gustavoars:testing/wfamnae-next20250606 16/16] kernel/sched/ext.c:3748:24: error: returning 'struct cgroup_hdr *' from a function with incompatible return type 'struct cgroup *'"
Previous message: Brian Gerst: "Re: [PATCH v2 45/62] x86/extable: Define ELF section entry size for exception tables"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]