patch cow-swapin [was Re: Very bad swap bug -- 2.0, 2.1 at least]

Andrea Arcangeli (andrea@e-mind.com)
Wed, 23 Sep 1998 03:53:43 +0200 (CEST)


On Tue, 15 Sep 1998, Simon Kirby wrote:

>This swap bug that I mentioned a while back is still happening, and this
>time seems to be much worse than before. In this particular case it is
>happening on a medium-loaded web server running 2.0.35. It's causing the
>load to go up to 5 or so due to the disk I/O being so saturated that many
>user processes get blocked.
>
>"vmstat 1" output:
>
> procs memory swap io system cpu
> r b w swpd free buff cache si so bi bo in cs us sy id
> 1 0 0 5376 18236 10200 52244 380 0 174 0 292 305 2 9 89
> 1 3 0 5376 18148 10200 52380 444 0 292 0 365 387 17 17 65
> 1 1 0 5376 17508 10200 52668 404 0 410 0 374 369 9 15 77
> 2 0 0 5376 17464 10200 52644 112 0 28 0 227 149 9 12 80
> 4 1 0 5376 16912 10200 52796 24 0 166 0 242 163 10 8 83
> 0 2 0 5376 16620 10200 52944 480 0 276 0 353 365 7 13 80
> 1 0 0 5376 16684 10200 53008 436 0 205 0 336 377 26 20 53
> 0 1 0 5376 17232 10200 53084 348 0 164 0 270 265 8 12 81
> 2 0 0 5376 16932 10200 53220 124 0 183 0 220 173 6 15 80
> 5 3 0 5376 16576 10200 53300 420 0 186 0 285 297 14 8 78
> 2 2 0 5376 16516 10200 53300 500 0 127 0 324 355 7 10 83
> 2 1 0 5376 15928 10200 53320 520 0 147 0 295 355 7 13 80
> 1 0 0 5376 16308 10200 53348 180 0 89 0 256 282 34 13 53
> 1 1 0 5376 15852 10200 53524 240 0 245 0 260 242 10 13 78
> 0 0 0 5376 16028 10200 53596 264 0 125 0 275 471 13 9 79

It' s a bit late but I' m happy because I think to have finally fixed
the problem pretty well (and right) here.

It' s a bit difficult to explain with _English_ words the cause of the
problem. In short you have a process that has touched some of its .data
area and then it never used it anymore so such data got swapped out. Then
the .text (in RAM) of that process run a fork() and the child process
start touching again the old swapped out data that so has to be swapped in
but the parent process don' t know that such data is been swapped in and
so, if it will fork again, the new child will swapin again.

This is an simple proggy I developed that trigger the problem:

/*
* COW_swapin.c Copyright (C) 1998 Andrea Arcangeli
*/

#include <stdio.h>

#define BUFSIZE 2000000

volatile int buf[BUFSIZE];

main()
{
volatile int i;
for (i=0; i<BUFSIZE; i++)
buf[i] = i;
for(;;)
{
sleep(10);
if(!fork())
{
printf("now\n");
for (i=0; i<BUFSIZE; i++)
buf[i] = i;
break;
}
else
wait();
}
}

To see the swapin every 10 sec you only need to run the proggy in
background and then run some istance of this second proggy to force the
swapout of the dirty buffer (you can also run huge application or course
instead of the following proggy). Then you can kill the following proggy
and leaving the only first process in background. It will swapin every 10
sec _tons_ of data.

main()
{
char *p[20];
int i, j;
for (j=0; j<20; j++)
{
p[j] = (char *) malloc(1000000);
}
for (;;)
for (j=0; j<20; j++)
{
for (i=0; i<1000000; i++)
p[j][i] = 0;
}
}

I developed a fix that seems to works well _here_ (no-SMP x86).

The idea implemented in the patch is pretty simple. Once the kernel has
swapped in a swap entry for a process that has not yet run an exec, I
check if the parent share the same swap_entry() of the child that is
been just swapped in. If so I uptodate the pte of the parent with the new
page. If the page fault is a write one I do the cow work.

I' ll do a 2.0.x backport soon.

It would be nice if somebody would try it also on SMP and no-x86 arch.

The patch is against 2.1.122.

--- devel/kernel-tree/linux-2.1.122/include/linux/swap.h Sat Sep 5 14:17:56 1998
+++ linux/include/linux/swap.h Tue Sep 22 23:26:25 1998
@@ -81,6 +81,9 @@
extern void swap_in(struct task_struct *, struct vm_area_struct *,
pte_t *, unsigned long, int);

+/* linux/mm/swapin_parent.c */
+FASTCALL(extern void swapin_parent(struct task_struct *, unsigned long,
+ pte_t *, unsigned long, unsigned int));

/* linux/mm/swap_state.c */
extern void show_swap_cache_info(void);
--- /dev/null Tue May 6 02:10:56 1997
+++ linux/mm/swapin_parent.c Wed Sep 23 03:34:48 1998
@@ -0,0 +1,201 @@
+/*
+ * swapin_parent: join mem between swapped in childs and a swapped out parents
+ * Copyright (C) 1998 Andrea Arcangeli
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ * You can reach Andrea Arcangeli at <andrea@e-mind.com>.
+ */
+
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/kernel.h>
+#include <linux/swap.h>
+
+#include <asm/pgtable.h>
+
+static __inline__ unsigned long duplicate(unsigned long old_page)
+{
+ /*
+ * The tasklist_lock is read_locked, so we really can' t sleep here.
+ * And I don' t like to avoid swapins swapping out some other thing
+ * more recently used.
+ */
+ unsigned long new_page = __get_free_page(GFP_ATOMIC);
+ if (new_page)
+ copy_page(new_page, old_page);
+ return new_page;
+}
+
+#define pte_mkcow(new_page, vma) \
+ pte_mkwrite(pte_mkdirty(mk_pte((new_page), (vma)->vm_page_prot)))
+
+FASTCALL(static void do_swapin_parent(pte_t *new_pte, pte_t *pte,
+ unsigned long entry,
+ struct vm_area_struct *vma,
+ unsigned int write,
+ struct task_struct *parent));
+
+static void do_swapin_parent(pte_t *new_pte, pte_t *pte,
+ unsigned long entry, struct vm_area_struct *vma,
+ unsigned int write, struct task_struct *parent)
+{
+ struct page *page;
+ unsigned long map_nr;
+
+ map_nr = MAP_NR(pte_page(*new_pte));
+ page = &mem_map[map_nr];
+
+ if (map_nr >= max_mapnr)
+ {
+ printk(KERN_ERR "do_swapin_parent: map_nr >= max_mapnr!\n");
+ return;
+ }
+ if (PageReserved(mem_map+map_nr))
+ {
+ printk(KERN_ERR "do_swapin_parent: "
+ "swapped in page was reserved!\n");
+ return;
+ }
+
+ if (write)
+ {
+ unsigned long new_page;
+ if (!pte_write(*new_pte))
+ {
+ printk(KERN_WARNING "do_swapin_parent: swapin after "
+ "writefault marked the page not writable\n");
+ return;
+ }
+ new_page = duplicate(pte_page(*new_pte));
+ if (!new_page)
+ return;
+ set_pte(pte, pte_mkcow(new_page, vma));
+ } else {
+ if (pte_write(*new_pte))
+ {
+ printk(KERN_WARNING "do_swapin_parent: swapin after "
+ "readfault marked the page writable\n");
+ return;
+ }
+ set_pte(pte, *new_pte);
+ atomic_inc(&page->count);
+ }
+
+ ++vma->vm_mm->rss;
+ ++parent->maj_flt;
+ swap_free(entry);
+}
+
+
+FASTCALL(static void try_to_swapin_parent(struct task_struct *parent,
+ unsigned long address,
+ pte_t *new_pte, unsigned long entry,
+ unsigned int write));
+
+static void try_to_swapin_parent(struct task_struct *parent,
+ unsigned long address,
+ pte_t *new_pte, unsigned long entry,
+ unsigned int write)
+{
+ pgd_t *pgd;
+ pmd_t *pmd;
+ pte_t *pte;
+ struct vm_area_struct *vma;
+
+ vma = find_vma(parent->mm, address);
+ if (!vma)
+ {
+ printk(KERN_ERR "try_to_swapin_parent: NULL vma!\n");
+ return;
+ }
+
+ pgd = pgd_offset(vma->vm_mm, address);
+ if (pgd_none(*pgd))
+ return;
+ if (pgd_bad(*pgd)) {
+ printk(KERN_ERR "try_to_swapin_parent: bad pgd (%08lx)\n",
+ pgd_val(*pgd));
+ pgd_clear(pgd);
+ return;
+ }
+
+ pmd = pmd_offset(pgd, address);
+ if (pmd_none(*pmd))
+ return;
+ if (pmd_bad(*pmd))
+ {
+ printk(KERN_ERR "try_to_swapin_parent: bad pmd (%08lx)\n",
+ pmd_val(*pmd));
+ pmd_clear(pmd);
+ return;
+ }
+
+ pte = pte_offset(pmd, address);
+
+ if (pte_val(*pte) != entry)
+ return;
+
+ do_swapin_parent(new_pte, pte, entry, vma, write, parent);
+}
+
+void swapin_parent(struct task_struct *child, unsigned long address,
+ pte_t *new_pte, unsigned long entry, unsigned int write)
+{
+ struct task_struct *parent;
+
+ if (child->did_exec)
+ return;
+
+ /*
+ * A bit of PARANOID.
+ */
+ if (pte_val(*new_pte) == entry)
+ {
+ printk(KERN_WARNING "swapin_parent: child not yet swapped "
+ "in\n");
+ return;
+ }
+ if (pte_val(*new_pte) == pte_val(BAD_PAGE))
+ {
+ printk(KERN_WARNING "swapin_parent: swapped in page is BAD\n");
+ return;
+ }
+ if (pte_none(*new_pte))
+ {
+ printk(KERN_ERR "swapin_parent: child page table NULL!\n");
+ return;
+ }
+ if (!pte_present(*new_pte))
+ {
+ printk(KERN_ERR "swapin_parent: child wrong swap entry!\n");
+ return;
+ }
+
+ read_lock(&tasklist_lock);
+ parent = child->p_pptr;
+ if (!parent)
+ {
+ printk(KERN_ERR "swapin_parent: parent NULL!\n");
+ goto out_unlock;
+ }
+#ifdef __SMP__
+ if (parent->has_cpu)
+ goto out_unlock;
+#endif
+ try_to_swapin_parent(parent, address, new_pte, entry, write);
+ out_unlock:
+ read_unlock(&tasklist_lock);
+}
--- devel/kernel-tree/linux-2.1.122/mm/memory.c Thu Sep 17 18:43:12 1998
+++ linux/mm/memory.c Tue Sep 22 23:16:51 1998
@@ -641,7 +641,6 @@
struct page * page_map;

pte = *page_table;
- new_page = __get_free_page(GFP_KERNEL);
/* Did someone else copy this page for us while we slept? */
if (pte_val(*page_table) != pte_val(pte))
goto end_wp_page;
@@ -659,6 +658,7 @@
* Do we need to copy?
*/
if (is_page_shared(page_map)) {
+ new_page = __get_free_page(GFP_KERNEL);
if (new_page) {
if (PageReserved(mem_map + MAP_NR(old_page)))
++vma->vm_mm->rss;
@@ -683,15 +683,11 @@
flush_cache_page(vma, address);
set_pte(page_table, pte_mkdirty(pte_mkwrite(pte)));
flush_tlb_page(vma, address);
- if (new_page)
- free_page(new_page);
return;
bad_wp_page:
printk("do_wp_page: bogus page at address %08lx (%08lx)\n",address,old_page);
send_sig(SIGKILL, tsk, 1);
end_wp_page:
- if (new_page)
- free_page(new_page);
return;
}

@@ -789,6 +785,8 @@

if (!vma->vm_ops || !vma->vm_ops->swapin) {
swap_in(tsk, vma, page_table, pte_val(entry), write_access);
+ swapin_parent(tsk, address, page_table, pte_val(entry),
+ write_access);
flush_page_to_ram(pte_page(*page_table));
return;
}
--- devel/kernel-tree/linux-2.1.122/mm/Makefile Mon Jun 1 23:57:07 1998
+++ linux/mm/Makefile Tue Sep 22 15:56:09 1998
@@ -9,7 +9,7 @@

O_TARGET := mm.o
O_OBJS := memory.o mmap.o filemap.o mprotect.o mlock.o mremap.o \
- vmalloc.o slab.o \
+ vmalloc.o slab.o swapin_parent.o \
swap.o vmscan.o page_io.o page_alloc.o swap_state.o swapfile.o

include $(TOPDIR)/Rules.make

Andrea[s] Arcangeli

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/