Re: [PATCH] mm/damon/vaddr: Safely walk page table

From: David Hildenbrand
Date: Tue Aug 31 2021 - 07:46:49 EST


On 31.08.21 12:49, SeongJae Park wrote:
From: SeongJae Park <sjpark@xxxxxxxxx>

On Tue, 31 Aug 2021 11:53:05 +0200 David Hildenbrand <david@xxxxxxxxxx> wrote:

On 27.08.21 17:04, SeongJae Park wrote:
From: SeongJae Park <sjpark@xxxxxxxxx>

Commit d7f647622761 ("mm/damon: implement primitives for the virtual
memory address spaces") of linux-mm[1] tries to find PTE or PMD for
arbitrary virtual address using 'follow_invalidate_pte()' without proper
locking[2]. This commit fixes the issue by using another page table
walk function for more general use case under proper locking.

[1] https://github.com/hnaz/linux-mm/commit/d7f647622761
[2] https://lore.kernel.org/linux-mm/3b094493-9c1e-6024-bfd5-7eca66399b7e@xxxxxxxxxx

Fixes: d7f647622761 ("mm/damon: implement primitives for the virtual memory address spaces")
Reported-by: David Hildenbrand <david@xxxxxxxxxx>
Signed-off-by: SeongJae Park <sjpark@xxxxxxxxx>
---
mm/damon/vaddr.c | 81 +++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 74 insertions(+), 7 deletions(-)

diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 230db7413278..b3677f2ef54b 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -8,10 +8,12 @@
#define pr_fmt(fmt) "damon-va: " fmt
#include <linux/damon.h>
+#include <linux/hugetlb.h>
#include <linux/mm.h>
#include <linux/mmu_notifier.h>
#include <linux/highmem.h>
#include <linux/page_idle.h>
+#include <linux/pagewalk.h>
#include <linux/random.h>
#include <linux/sched/mm.h>
#include <linux/slab.h>
@@ -446,14 +448,69 @@ static void damon_pmdp_mkold(pmd_t *pmd, struct mm_struct *mm,
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
}
+struct damon_walk_private {
+ pmd_t *pmd;
+ pte_t *pte;
+ spinlock_t *ptl;
+};
+
+static int damon_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next,
+ struct mm_walk *walk)
+{
+ struct damon_walk_private *priv = walk->private;
+
+ if (pmd_huge(*pmd)) {
+ priv->ptl = pmd_lock(walk->mm, pmd);
+ if (pmd_huge(*pmd)) {
+ priv->pmd = pmd;
+ return 0;
+ }
+ spin_unlock(priv->ptl);
+ }
+
+ if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
+ return -EINVAL;
+ priv->pte = pte_offset_map_lock(walk->mm, pmd, addr, &priv->ptl);
+ if (!pte_present(*priv->pte)) {
+ pte_unmap_unlock(priv->pte, priv->ptl);
+ priv->pte = NULL;
+ return -EINVAL;
+ }
+ return 0;
+}
+
+static struct mm_walk_ops damon_walk_ops = {
+ .pmd_entry = damon_pmd_entry,
+};
+
+int damon_follow_pte_pmd(struct mm_struct *mm, unsigned long addr,
+ struct damon_walk_private *private)
+{
+ int rc;
+
+ private->pte = NULL;
+ private->pmd = NULL;
+ rc = walk_page_range(mm, addr, addr + 1, &damon_walk_ops, private);
+ if (!rc && !private->pte && !private->pmd)
+ return -EINVAL;
+ return rc;
+}
+
static void damon_va_mkold(struct mm_struct *mm, unsigned long addr)
{
- pte_t *pte = NULL;
- pmd_t *pmd = NULL;
+ struct damon_walk_private walk_result;
+ pte_t *pte;
+ pmd_t *pmd;
spinlock_t *ptl;
- if (follow_invalidate_pte(mm, addr, NULL, &pte, &pmd, &ptl))
+ mmap_write_lock(mm);

Can you elaborate why mmap_read_lock() isn't sufficient for your use
case? The write mode might heavily affect damon performance and workload
impact.

Because as you also mentioned in the previous mail, 'we can walk page tables
ignoring VMAs with the mmap semaphore held in write mode', and in this case we
don't know to which VMA the address is belong. I thought the link to the mail
can help people understanding the reason. But, as you are suggesting, I now
think putting an elaborated explanation here would be much better. I will also
put a warning for the possible performance impact.

walk_page_range() make sure to skip any VMA holes and only walks ranges within VMAs. With the mmap sem in read mode, the VMA layout (mostly) cannot change, so calling walk_page_range() is fine. So pagewalk.c properly takes care of VMAs.

As an example, take a look at MADV_COLD handling in mm/madvise.c.

madvise_need_mmap_write() returns "0", and we end up calling madvise_cold()->...->walk_page_range() with mmap_lock_read().

You can exclude any VMAs you don't care about in the test_walk() callback, if required.

--
Thanks,

David / dhildenb