[PATCH v2] mm: make every pte dirty on do_swap_page

From: Minchan Kim
Date: Mon Mar 30 2015 - 00:43:08 EST


Bascially, MADV_FREE relys on the dirty bit in page table entry
to decide whether VM allows to discard the page or not.
IOW, if page table entry includes marked dirty bit, VM shouldn't
discard the page.

However, if swap-in by read fault happens, page table entry
point out the page doesn't have marked dirty bit so MADV_FREE
might discard the page wrongly. For avoiding the problem,
MADV_FREE did more checks with PageDirty and PageSwapCache.
It worked out because swapped-in page lives on swap cache
and since it was evicted from the swap cache, the page has
PG_dirty flag. So both page flags checks effectvely prevent
wrong discarding by MADV_FREE.

A problem in above logic is that swapped-in page has PG_dirty
since they are removed from swap cache so VM cannot consider
those pages as freeable any more alghouth madvise_free is
called in future. Look at below example for detail.

ptr = malloc();
memset(ptr);
..
..
.. heavy memory pressure so all of pages are swapped out
..
..
var = *ptr; -> a page swapped-in and removed from swapcache.
page table doesn't mark dirty bit and page
descriptor includes PG_dirty
..
..
madvise_free(ptr);
..
..
..
.. heavy memory pressure again.
.. In this time, VM cannot discard the page because the page
.. has *PG_dirty*

Rather than relying on the PG_dirty of page descriptor
for preventing discarding a page, dirty bit in page table is more
straightforward and simple. So, this patch makes page table dirty
bit marked whenever swap-in happens. Inherenty, page table entry
point out swapped-out page had dirty bit so I think it's no prblem.

With this, it removes complicated logic and makes freeable page
checking by madvise_free simple. Of course, we could solve
above mentioned example.

Cc: Hugh Dickins <hughd@xxxxxxxxxx>
Cc: Cyrill Gorcunov <gorcunov@xxxxxxxxx>
Cc: Pavel Emelyanov <xemul@xxxxxxxxxxxxx>
Reported-by: Yalin Wang <yalin.wang@xxxxxxxxxxxxxx>
Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx>
---

* From v1:
* Rewrite description - Andrew

mm/madvise.c | 1 -
mm/memory.c | 10 ++++++++--
mm/rmap.c | 2 +-
mm/vmscan.c | 3 +--
4 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 22e8f0c..a045798 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -325,7 +325,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
continue;
}

- ClearPageDirty(page);
unlock_page(page);
}

diff --git a/mm/memory.c b/mm/memory.c
index 6743966..48ff537 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2521,9 +2521,15 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,

inc_mm_counter_fast(mm, MM_ANONPAGES);
dec_mm_counter_fast(mm, MM_SWAPENTS);
- pte = mk_pte(page, vma->vm_page_prot);
+
+ /*
+ * The page is swapping in now was dirty before it was swapped out
+ * so restore the state again(ie, pte_mkdirty) because MADV_FREE
+ * relies on the dirty bit on page table.
+ */
+ pte = pte_mkdirty(mk_pte(page, vma->vm_page_prot));
if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
- pte = maybe_mkwrite(pte_mkdirty(pte), vma);
+ pte = maybe_mkwrite(pte, vma);
flags &= ~FAULT_FLAG_WRITE;
ret |= VM_FAULT_WRITE;
exclusive = 1;
diff --git a/mm/rmap.c b/mm/rmap.c
index dad23a4..281e806 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1275,7 +1275,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,

if (flags & TTU_FREE) {
VM_BUG_ON_PAGE(PageSwapCache(page), page);
- if (!dirty && !PageDirty(page)) {
+ if (!dirty) {
/* It's a freeable page by MADV_FREE */
dec_mm_counter(mm, MM_ANONPAGES);
goto discard;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index dc6cd51..fffebf0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -805,8 +805,7 @@ static enum page_references page_check_references(struct page *page,
return PAGEREF_KEEP;
}

- if (PageAnon(page) && !pte_dirty && !PageSwapCache(page) &&
- !PageDirty(page))
+ if (PageAnon(page) && !pte_dirty && !PageSwapCache(page))
*freeable = true;

/* Reclaim if clean, defer dirty pages to writeback */
--
1.9.3

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/