Re: [PATCH 1/1] Fix zero copy I/O on __get_user_pages allocated pages

From: David Hildenbrand
Date: Thu May 08 2025 - 11:37:43 EST


On 08.05.25 17:23, Pantelis Antoniou wrote:
On Thu, 8 May 2025 17:03:46 +0200
David Hildenbrand <david@xxxxxxxxxx> wrote:

Hi there,

On 07. 05. 25 17: 41, Pantelis Antoniou wrote: Hi, > Recent updates
to net filesystems enabled zero copy operations, > which require
getting a user space page pinned. > > This does not work for pages
that were allocated via __get_user_pages
On 07.05.25 17:41, Pantelis Antoniou wrote:

Hi,

Recent updates to net filesystems enabled zero copy operations,
which require getting a user space page pinned.

This does not work for pages that were allocated via
__get_user_pages and then mapped to user-space via remap_pfn_rage.

Right. Because the struct page of a VM_PFNMAP *must not be touched*.
It has to be treated like it doesn't exist.


Well, that's not exactly the case. For pages mapped to user space via
remap_pfn_range() the VM_PFNMAP bit is set even though the pages do
have a struct page.

Yes. And VM_PFNMAP is the big flag that these pages shall not be touched. Even if it exists. Even if you say please. :)

See the comment above vm_normal_page():

"Special" mappings do not wish to be associated with a "struct
page" (either it doesn't exist, or it exists but they don't want
to touch it)

VM_MIXEDMAP could maybe be used for that purpose: possibly GUP also has to be updated to make use of that. (I was hoping we can get rid of VM_MIXEDMAP at some point)



The details of how it happens are at the cover page of this patch but
let me paste the relevant bits here.

"In our emulation environment we have noticed failing writes when
performing I/O from a userspace mapped DRM GEM buffer object.
The platform does not use VRAM, all graphics memory is regular DRAM
memory, allocated via __get_free_pages

The same write was successful from a heap allocated bounce buffer.

The sequence of events is as follows.

1. A BO (Buffer Object) is created, and it's backing memory is allocated via
__get_user_pages()

__get_user_pages() only allocates memory via page faults. Are you sure you meant __get_user_pages() here?


2. Userspace mmaps a BO (Buffer Object) via a mmap call on the opened
file handle of a DRM driver. The mapping is done via the
drm_gem_mmap_obj() call.

3. Userspace issues a write to a file copying the contents of the BO.

3a. If the file is located on regular filesystem (like ext4), the write
completes successfully.

3b. If the file is located on a network filesystem, like 9p the write fails.

The write fails because v9fs_file_write_iter() will call
netfs_unbuffered_write_iter(), netfs_unbuffered_write_iter_locked() which will
call netfs_extract_user_iter()

netfs_extract_user_iter() will in turn call iov_iter_extract_pages() which for
a user backed iterator will call iov_iter_extract_user_pages which will call
pin_user_pages_fast() which finally will call __gup_longterm_locked().

__gup_longterm_locked() will call __get_user_pages_locked() which will fail
because the VMA is marked with the VM_IO and VM_PFNMAP flags."

So, drm_gem_mmap_obj() ends up using remap_pfn_rage()?

I can spot that drm_gem_mmap_obj() has a path where it explicitly sets

vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND |
VM_DONTDUMP);

Which is a clear sign to core-MM (incl. GUP) to never mess with the mapped pages.



remap_pfn_range_internal() will turn on VM_IO | VM_PFNMAP vma bits.
VM_PFNMAP in particular mark the pages as not having struct_page
associated with them, which is not the case for __get_user_pages()

This in turn makes any attempt to lock a page fail, and breaking
I/O from that address range.

This patch address it by special casing pages in those VMAs and not
calling vm_normal_page() for them.

Signed-off-by: Pantelis Antoniou <p.antoniou@xxxxxxxxxxxxxxxxxxx>
---
mm/gup.c | 22 ++++++++++++++++++----
1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 84461d384ae2..e185c18c0c81 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -833,6 +833,20 @@ static inline bool can_follow_write_pte(pte_t
pte, struct page *page, return !userfaultfd_pte_wp(vma, pte);
}
+static struct page *gup_normal_page(struct vm_area_struct *vma,
+ unsigned long address, pte_t pte)
+{
+ unsigned long pfn;
+
+ if (vma->vm_flags & (VM_MIXEDMAP | VM_PFNMAP)) {
+ pfn = pte_pfn(pte);
+ if (!pfn_valid(pfn) || is_zero_pfn(pfn) || pfn >
highest_memmap_pfn)
+ return NULL;
+ return pfn_to_page(pfn);
+ }
+ return vm_normal_page(vma, address, pte);

I enjoy seeing vm_normal_page() checks in GUP code.

I don't enjoy seeing what you added before that :)

If vm_normal_page() tells you "this is not a normal", then we should
not touch it. There is one exception: the shared zeropage.


So, unfortunately, this is wrong.


Well, lets talk about a proper fix then for the previously mentioned
user-space regression.

You really have to find out the responsible commit. GUP has been behaving like that forever I'm afraid.

And even the VM_PFNMAP was in drm_gem_mmap_obj() already at least in 2012 if I am not wrong.

--
Cheers,

David / dhildenb