[RFC PATCH v4 4/8] hugetlbfs: catch and handle truncate racing with page faults

From: Mike Kravetz
Date: Wed Jul 06 2022 - 16:24:46 EST


Most hugetlb fault handling code checks for faults beyond i_size.
While there are early checks in the code paths, the most difficult
to handle are those discovered after taking the page table lock.
At this point, we have possibly allocated a page and consumed
associated reservations and possibly added the page to the page cache.

When discovering a fault beyond i_size, be sure to:
- Remove the page from page cache, else it will sit there until the
file is removed.
- Do not restore any reservation for the page consumed. Otherwise
there will be an outstanding reservation for an offset beyond the
end of file.

The 'truncation' code in remove_inode_hugepages must deal with fault
code potentially removing a page/folio from the cache after the page was
returned by filemap_get_folios and before locking the page. This can be
discovered by a change in folio_mapping() after taking folio lock. In
addition, this code must deal with fault code potentially consuming
and returning reservations. To synchronize this, remove_inode_hugepages
will now take the fault mutex for ALL indices in the hole or truncated
range. In this way, it KNOWS fault code has finished with the page/index
OR fault code will see the updated file size.

Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
---
fs/hugetlbfs/inode.c | 88 ++++++++++++++++++++++++++++++--------------
mm/hugetlb.c | 39 +++++++++++++++-----
2 files changed, 90 insertions(+), 37 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a878c672cf6d..31bd4325fce5 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -443,11 +443,10 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
* truncation is indicated by end of range being LLONG_MAX
* In this case, we first scan the range and release found pages.
* After releasing pages, hugetlb_unreserve_pages cleans up region/reserve
- * maps and global counts. Page faults can not race with truncation
- * in this routine. hugetlb_no_page() prevents page faults in the
- * truncated range. It checks i_size before allocation, and again after
- * with the page table lock for the page held. The same lock must be
- * acquired to unmap a page.
+ * maps and global counts. Page faults can race with truncation.
+ * During faults, hugetlb_no_page() checks i_size before page allocation,
+ * and again after obtaining page table lock. It will 'back out'
+ * allocations in the truncated range.
* hole punch is indicated if end is not LLONG_MAX
* In the hole punch case we scan the range and release found pages.
* Only when releasing a page is the associated region/reserve map
@@ -456,27 +455,46 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
* This is indicated if we find a mapped page.
* Note: If the passed end of range value is beyond the end of file, but
* not LLONG_MAX this routine still performs a hole punch operation.
+ *
+ * Since page faults can race with this routine, care must be taken as both
+ * modify huge page reservation data. To somewhat synchronize these operations
+ * the hugetlb fault mutex is taken for EVERY index in the range to be hole
+ * punched or truncated. In this way, we KNOW fault code will either have
+ * completed backout operations under the mutex, or fault code will see the
+ * updated file size and not allocate a page for offsets beyond truncated size.
+ * The parameter 'lm__end' indicates the offset of the end of hole or file
+ * before truncation. For hole punch lm_end == lend.
*/
static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
- loff_t lend)
+ loff_t lend, loff_t lm_end)
{
struct hstate *h = hstate_inode(inode);
struct address_space *mapping = &inode->i_data;
const pgoff_t start = lstart >> huge_page_shift(h);
const pgoff_t end = lend >> huge_page_shift(h);
+ pgoff_t m_end = lm_end >> huge_page_shift(h);
+ pgoff_t m_start, m_index;
struct folio_batch fbatch;
pgoff_t next, index;
int i, freed = 0;
+ u32 hash;
bool truncate_op = (lend == LLONG_MAX);

folio_batch_init(&fbatch);
- next = start;
+ next = m_start = start;
while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) {
for (i = 0; i < folio_batch_count(&fbatch); ++i) {
struct folio *folio = fbatch.folios[i];
- u32 hash = 0;

index = folio->index;
+ /* Take fault mutex for missing folios before index */
+ for (m_index = m_start; m_index < index; m_index++) {
+ hash = hugetlb_fault_mutex_hash(mapping,
+ m_index);
+ mutex_lock(&hugetlb_fault_mutex_table[hash]);
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ }
+ m_start = index + 1;
hash = hugetlb_fault_mutex_hash(mapping, index);
mutex_lock(&hugetlb_fault_mutex_table[hash]);

@@ -485,13 +503,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
* unmapped in caller. Unmap (again) now after taking
* the fault mutex. The mutex will prevent faults
* until we finish removing the folio.
- *
- * This race can only happen in the hole punch case.
- * Getting here in a truncate operation is a bug.
*/
if (unlikely(folio_mapped(folio))) {
- BUG_ON(truncate_op);
-
i_mmap_lock_write(mapping);
hugetlb_vmdelete_list(&mapping->i_mmap,
index * pages_per_huge_page(h),
@@ -502,20 +515,30 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,

folio_lock(folio);
/*
- * We must free the huge page and remove from page
- * cache BEFORE removing the region/reserve map
- * (hugetlb_unreserve_pages). In rare out of memory
- * conditions, removal of the region/reserve map could
- * fail. Correspondingly, the subpool and global
- * reserve usage count can need to be adjusted.
+ * After locking page, make sure mapping is the same.
+ * We could have raced with page fault populate and
+ * backout code.
*/
- VM_BUG_ON(HPageRestoreReserve(&folio->page));
- hugetlb_delete_from_page_cache(&folio->page);
- freed++;
- if (!truncate_op) {
- if (unlikely(hugetlb_unreserve_pages(inode,
+ if (folio_mapping(folio) == mapping) {
+ /*
+ * We must free the folio and remove from
+ * page cache BEFORE removing the region/
+ * reserve map (hugetlb_unreserve_pages). In
+ * rare out of memory conditions, removal of
+ * the region/reserve map could fail.
+ * Correspondingly, the subpool and global
+ * reserve usage count can need to be adjusted.
+ */
+ VM_BUG_ON(HPageRestoreReserve(&folio->page));
+ hugetlb_delete_from_page_cache(&folio->page);
+ freed++;
+ if (!truncate_op) {
+ if (unlikely(
+ hugetlb_unreserve_pages(inode,
index, index + 1, 1)))
- hugetlb_fix_reserve_counts(inode);
+ hugetlb_fix_reserve_counts(
+ inode);
+ }
}

folio_unlock(folio);
@@ -525,6 +548,13 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
cond_resched();
}

+ /* Take fault mutex for missing folios at end of range */
+ for (m_index = m_start; m_index < m_end; m_index++) {
+ hash = hugetlb_fault_mutex_hash(mapping, m_index);
+ mutex_lock(&hugetlb_fault_mutex_table[hash]);
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ }
+
if (truncate_op)
(void)hugetlb_unreserve_pages(inode, start, LONG_MAX, freed);
}
@@ -532,8 +562,9 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
static void hugetlbfs_evict_inode(struct inode *inode)
{
struct resv_map *resv_map;
+ loff_t prev_size = i_size_read(inode);

- remove_inode_hugepages(inode, 0, LLONG_MAX);
+ remove_inode_hugepages(inode, 0, LLONG_MAX, prev_size);

/*
* Get the resv_map from the address space embedded in the inode.
@@ -553,6 +584,7 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
pgoff_t pgoff;
struct address_space *mapping = inode->i_mapping;
struct hstate *h = hstate_inode(inode);
+ loff_t prev_size = i_size_read(inode);

BUG_ON(offset & ~huge_page_mask(h));
pgoff = offset >> PAGE_SHIFT;
@@ -563,7 +595,7 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0,
ZAP_FLAG_DROP_MARKER);
i_mmap_unlock_write(mapping);
- remove_inode_hugepages(inode, offset, LLONG_MAX);
+ remove_inode_hugepages(inode, offset, LLONG_MAX, prev_size);
}

static void hugetlbfs_zero_partial_page(struct hstate *h,
@@ -635,7 +667,7 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)

/* Remove full pages from the file. */
if (hole_end > hole_start)
- remove_inode_hugepages(inode, hole_start, hole_end);
+ remove_inode_hugepages(inode, hole_start, hole_end, hole_end);

inode_unlock(inode);

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a9f320c676e4..25f644a3a981 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5491,6 +5491,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
spinlock_t *ptl;
unsigned long haddr = address & huge_page_mask(h);
bool new_page, new_pagecache_page = false;
+ bool beyond_i_size = false;
+ bool reserve_alloc = false;

/*
* Currently, we are forced to kill the process in the event the
@@ -5548,6 +5550,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
clear_huge_page(page, address, pages_per_huge_page(h));
__SetPageUptodate(page);
new_page = true;
+ if (HPageRestoreReserve(page))
+ reserve_alloc = true;

if (vma->vm_flags & VM_MAYSHARE) {
int err = hugetlb_add_to_page_cache(page, mapping, idx);
@@ -5606,8 +5610,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,

ptl = huge_pte_lock(h, mm, ptep);
size = i_size_read(mapping->host) >> huge_page_shift(h);
- if (idx >= size)
+ if (idx >= size) {
+ beyond_i_size = true;
goto backout;
+ }

ret = 0;
/* If pte changed from under us, retry */
@@ -5652,10 +5658,25 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
backout:
spin_unlock(ptl);
backout_unlocked:
+ if (new_page) {
+ if (new_pagecache_page)
+ hugetlb_delete_from_page_cache(page);
+
+ /*
+ * If reserve was consumed, make sure flag is set so that it
+ * will be restored in free_huge_page().
+ */
+ if (reserve_alloc)
+ SetHPageRestoreReserve(page);
+
+ /*
+ * Do not restore reserve map entries beyond i_size.
+ * Otherwise, there will be leaks when the file is removed.
+ */
+ if (!beyond_i_size)
+ restore_reserve_on_error(h, vma, haddr, page);
+ }
unlock_page(page);
- /* restore reserve for newly allocated pages not in page cache */
- if (new_page && !new_pagecache_page)
- restore_reserve_on_error(h, vma, haddr, page);
put_page(page);
goto out;
}
@@ -5975,15 +5996,15 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
* Recheck the i_size after holding PT lock to make sure not
* to leave any page mapped (as page_mapped()) beyond the end
* of the i_size (remove_inode_hugepages() is strict about
- * enforcing that). If we bail out here, we'll also leave a
- * page in the radix tree in the vm_shared case beyond the end
- * of the i_size, but remove_inode_hugepages() will take care
- * of it as soon as we drop the hugetlb_fault_mutex_table.
+ * enforcing that). If we bail out here, remove the page
+ * added to the radix tree.
*/
size = i_size_read(mapping->host) >> huge_page_shift(h);
ret = -EFAULT;
- if (idx >= size)
+ if (idx >= size) {
+ hugetlb_delete_from_page_cache(page);
goto out_release_unlock;
+ }

ret = -EEXIST;
/*
--
2.35.3