Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages

From: Pavel Begunkov
Date: Mon Jun 23 2025 - 14:01:09 EST


On 6/23/25 18:36, David Hildenbrand wrote:
On 23.06.25 18:59, David Hildenbrand wrote:
On 23.06.25 18:48, Pavel Begunkov wrote:
On 6/23/25 16:11, David Hildenbrand wrote:
...>>> Yes, it stores the head page even if the range passed to
pin_user_pages() doesn't cover the head page.
  > > It should be converted to unpin_user_folio(), which doesn't seem
to do sanity_check_pinned_pages(). Do you think that'll be enough
(conceptually)? Nobody is actually touching the head page in those
cases apart from the final unpin, and storing the head page is
more convenient than keeping folios. I'll take a look if it can
be fully converted to folios w/o extra overhead.

Assuming we had from GUP

nr_pages = 2
pages[0] = folio_page(folio, 1)
pages[1] = folio_page(folio, 2)

After io_coalesce_buffer() we have

nr_pages = 1
pages[0] = folio_page(folio, 0)


Using unpin_user_folio() in all places where we could see something like
that would be the right thing to do. The sanity checks are not in
unpin_user_folio() for exactly that reason: we don't know which folio
pages we pinned.

Let's do that for starters

But now I wonder where you make sure that "Nobody is actually touching
the head page"?

How do you get back the "which folio range" information after
io_coalesce_buffer() ?


If you rely on alignment in virtual address space for you, combined with
imu->folio_shift, that might not work reliably ...

FWIW, applying the following on top of origin/master:

diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c
index dbbcc5eb3dce5..e62a284dcf906 100644
--- a/tools/testing/selftests/mm/cow.c
+++ b/tools/testing/selftests/mm/cow.c
@@ -946,6 +946,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run, size_t thpsize)
                        log_test_result(KSFT_FAIL);
                        goto munmap;
                }
+               mem = mremap_mem;
                size = mremap_size;
                break;
        case THP_RUN_PARTIAL_SHARED:


and then running the selftest, something is not happy:

...
# [RUN] R/O-mapping a page registered as iouring fixed buffer ... with partially mremap()'ed THP (512 kB)
[34272.021973] Oops: general protection fault, maybe for address 0xffff8bab09d5b000: 0000 [#1] PREEMPT SMP NOPTI
[34272.021980] CPU: 3 UID: 0 PID: 1048307 Comm: iou-wrk-1047940 Not tainted 6.14.9-300.fc42.x86_64 #1
[34272.021983] Hardware name: LENOVO 20WNS1F81N/20WNS1F81N, BIOS N35ET53W (1.53 ) 03/22/2023
[34272.021984] RIP: 0010:memcpy+0xc/0x20
[34272.021989] Code: cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 66 90 48 89 f8 48 89 d1 <f3> a4 e9 4d f9 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90
[34272.021991] RSP: 0018:ffffcff459183c20 EFLAGS: 00010206
[34272.021993] RAX: ffff8bab09d5b000 RBX: 0000000000000fff RCX: 0000000000000fff
[34272.021994] RDX: 0000000000000fff RSI: 0021461670800001 RDI: ffff8bab09d5b000
[34272.021995] RBP: ffff8ba794866c40 R08: ffff8bab09d5b000 R09: 0000000000001000
[34272.021996] R10: ffff8ba7a316f9d0 R11: ffff8ba92f133080 R12: 0000000000000fff
[34272.021997] R13: ffff8baa85d5b6a0 R14: 0000000000000fff R15: 0000000000001000
[34272.021998] FS:  00007f16c568a740(0000) GS:ffff8baebf580000(0000) knlGS:0000000000000000
[34272.021999] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[34272.022000] CR2: 00007fffb6a10b00 CR3: 00000003df9eb006 CR4: 0000000000f72ef0
[34272.022001] PKRU: 55555554
[34272.022002] Call Trace:
[34272.022004]  <TASK>
[34272.022005]  copy_page_from_iter_atomic+0x36f/0x7e0
[34272.022009]  ? simple_xattr_get+0x59/0xa0
[34272.022012]  generic_perform_write+0x86/0x2e0
[34272.022016]  shmem_file_write_iter+0x86/0x90
[34272.022019]  io_write+0xe4/0x390
[34272.022023]  io_issue_sqe+0x65/0x4f0
[34272.022024]  ? lock_timer_base+0x7d/0xc0
[34272.022027]  io_wq_submit_work+0xb8/0x320
[34272.022029]  io_worker_handle_work+0xd5/0x300
[34272.022032]  io_wq_worker+0xda/0x300
[34272.022034]  ? finish_task_switch.isra.0+0x99/0x2c0
[34272.022037]  ? __pfx_io_wq_worker+0x10/0x10
[34272.022039]  ret_from_fork+0x34/0x50
[34272.022042]  ? __pfx_io_wq_worker+0x10/0x10
[34272.022044]  ret_from_fork_asm+0x1a/0x30
[34272.022047]  </TASK>


There, we essentially mremap a THP to not be aligned in VA space, and then register half the
THP as a fixed buffer.

So ... my suspicion that this is all rather broken grows :)

It's supposed to calculate the offset from a user pointer and
then work with that, but I guess there are masking that violate
it, I'll check.

--
Pavel Begunkov