Re: pre-2.0.32-2 + this patch works for me

Bernd Schmidt (crux@Pool.Informatik.RWTH-Aachen.DE)
Wed, 18 Jun 1997 14:07:01 +0200 (MET DST)


I've seen and tried various of the buffer cache fixes that have been posted
on this list. I've come up with a version of my own that behaves a little
better in a particular situation.

I use the following program to put some stress on the buffer cache:
----------- mmap-write-test.c
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>

#define SIZE 8*1024*1024

int main(int argc, char **argv)
{
int fd;
char *p;
int i;

if (argc != 3)
exit (1);

fd = open ("testfile", O_RDWR|O_CREAT|O_TRUNC,0777);
lseek (fd, SIZE, SEEK_SET);
write (fd, &i, 1);

fork();
p = mmap (0, SIZE, PROT_WRITE|PROT_READ, MAP_SHARED, fd, 0);
printf("%d %p\n", fd, p);
for (i = 0; i < atoi(argv[2]);i++) {
memset (p, atoi (argv[1]), SIZE);
}

return 0;
}
-----------
Start it with "mmap-write-test 2 2". It is written for a 486 with 8MB RAM
and 40MB swap - it won't do anything if you have more memory. Say "mem=8m"
when booting the kernel.
With current kernels with or without some of the patches I've seen, it can
do anything from aborting with SIGBUS, giving you "Out of memory for init"
messages or locking up the kernel in an endless loop.

I've attached a patch at the end of this mail that partially solves the
problem for me. It consists of bits and pieces taken from other people's
patches and some ideas of my own. First, let me explain what it does:

The main problem I see in the code is that swapping may need to allocate
buffers and buffer heads, which may need additional memory. This is the
heart of the problem that needs to be fixed. Everything else is just a
workaround.
Swapping needs buffer heads because page IO is done with faked buffer heads.
Swapping may also need buffers to determine bmap information if a shared dirty
page is to be written back to the file on disk.
There are two possible recursions that must be taken care of:

1. allocating buffers and/or buffer heads may need to get additional pages.
If IO is attempted to free memory, the code may end up trying to allocate
buffers again -> recursion
2. swapping may need buffers. To get the memory for the new buffers, another
swapout attempt could be made -> recursion.

It is absolutely vital that a process can allocate a new buffer or buffer
head immediately when it is doing a swap-out. In other cases, e.g. for a file
read, it is possible to try to free some memory to make room for the new
buffer. Note: we could try to do that even while swapping, but the code
_must_ work even if it can't free a single page! There must be a small amount
of reserved memory or reserved buffers that can only be used when swapping is
in progress.
In my patch, I've chosen to use the pool of min_free_pages pages which the
memory manager will never use for normal allocations for this reserved area.

It is impractical to keep track of whether a given buffer allocation is for
swapping or not. For this reason, I've added a new flag "in_swap_code" to
struct task. This is set whenever a process enters try_to_swap_out(), and
cleared when it leaves. The behaviour of some buffer cache functions is
changed depending on whether this flag is set:
- refill_freelist() uses GFP_BUFFER if not swapping. This will enable
get_free_page() to call try_to_free_page(). No IO will be done for
GFP_BUFFER allocations (as in David's original patch), and
try_to_free_buffers() will not be called from shrink_mmap().
refill_freelist() will try to allocate the full amount of buffers
specified by the nrefill value.
Using GFP_BUFFER instead of (for example) GFP_KERNEL prevents recursion 1.
- refill_freelist() uses GFP_ATOMIC if swapping. This will use pages below
the min_free_pages threshold if necessary. These pages will never be
touched by normal allocations, they are entirely reserved for GFP_ATOMIC.
This way, even with a decent (read: low) value for min_free_pages, you can
more-or-less guarantee that such an allocation will always succeed.
Also, if swapping, refill_freelist returns immediately if it could
allocate some more buffers. This guarantees that the min_free_pages limit
is not exceeded more than necessary.
Using GFP_ATOMIC also prevents further swap-out attempts (recursion 2)

This patch cures most of the problems for me, but it's not perfect. Every once
in a while, the above program still gets a SIGBUS. I don't know why. But the
problem is reduced by a great deal. (The buffer cache code is sufficiently
messy that it's quite possible that fixing this problem introduces new ones)
I'd appreciate feedback from anyone who has problems with 2.0.30/pre-2.0.31.

The enclosed patch is against plain 2.0.30.

Bernd

--- ./fs/buffer.c.orig-1 Tue Jun 17 15:03:50 1997
+++ ./fs/buffer.c Tue Jun 17 15:12:20 1997
@@ -597,6 +597,10 @@
int buffers[BUF_DIRTY];
int i;
int needed;
+ int pri = GFP_BUFFER;
+
+ if (current->in_swap_code)
+ pri = GFP_ATOMIC;

refilled = 1;
/* If there are too many dirty buffers, we wake up the update process
@@ -607,8 +611,10 @@
needed =bdf_prm.b_un.nrefill * size;

while (nr_free_pages > min_free_pages*2 && needed > 0 &&
- grow_buffers(GFP_BUFFER, size)) {
+ grow_buffers(pri, size)) {
needed -= PAGE_SIZE;
+ if (current->in_swap_code)
+ return;
}

repeat:
@@ -654,23 +660,21 @@
}
if (needed >= 0)
goto repeat;
+ if (current->in_swap_code)
+ return;
}

if(needed <= 0) return;

/* Too bad, that was not enough. Try a little harder to grow some. */
-
- if (nr_free_pages > min_free_pages + 5) {
- if (grow_buffers(GFP_BUFFER, size)) {
- needed -= PAGE_SIZE;
- goto repeat;
- };
- }
-
- /* and repeat until we find something good */
- if (!grow_buffers(GFP_ATOMIC, size))
+ if (!grow_buffers(pri, size))
wakeup_bdflush(1);
- needed -= PAGE_SIZE;
+ else {
+ needed -= PAGE_SIZE;
+ if (current->in_swap_code)
+ return;
+ }
+
goto repeat;
}

@@ -922,6 +926,10 @@
static void get_more_buffer_heads(void)
{
struct buffer_head * bh;
+ int pri = GFP_BUFFER;
+
+ if (current->in_swap_code)
+ pri = GFP_ATOMIC;

while (!unused_list) {
/*
@@ -932,7 +940,7 @@
/* we now use kmalloc() here instead of gfp as we want
to be able to easily release buffer heads - they
took up quite a bit of memory (tridge) */
- bh = (struct buffer_head *) kmalloc(sizeof(*bh),GFP_ATOMIC);
+ bh = (struct buffer_head *) kmalloc(sizeof(*bh), pri);
if (bh) {
put_unused_buffer_head(bh);
nr_buffer_heads++;
--- ./mm/vmscan.c.orig-1 Tue Jun 17 15:03:58 1997
+++ ./mm/vmscan.c Tue Jun 17 16:15:34 1997
@@ -68,7 +68,7 @@
* have died while we slept).
*/
static inline int try_to_swap_out(struct task_struct * tsk, struct vm_area_struct* vma,
- unsigned long address, pte_t * page_table, int dma, int wait)
+ unsigned long address, pte_t * page_table, int dma, int wait, int can_do_io)
{
pte_t pte;
unsigned long entry;
@@ -100,6 +100,8 @@
if (page_map->age)
return 0;
if (pte_dirty(pte)) {
+ if (!can_do_io)
+ return 0;
if (vma->vm_ops && vma->vm_ops->swapout) {
pid_t pid = tsk->pid;
vma->vm_mm->rss--;
@@ -157,7 +159,8 @@
*/

static inline int swap_out_pmd(struct task_struct * tsk, struct vm_area_struct * vma,
- pmd_t *dir, unsigned long address, unsigned long end, int dma, int wait)
+ pmd_t *dir, unsigned long address, unsigned long end, int dma, int wait,
+ int can_do_io)
{
pte_t * pte;
unsigned long pmd_end;
@@ -179,7 +182,7 @@
do {
int result;
tsk->swap_address = address + PAGE_SIZE;
- result = try_to_swap_out(tsk, vma, address, pte, dma, wait);
+ result = try_to_swap_out(tsk, vma, address, pte, dma, wait, can_do_io);
if (result)
return result;
address += PAGE_SIZE;
@@ -189,7 +192,8 @@
}

static inline int swap_out_pgd(struct task_struct * tsk, struct vm_area_struct * vma,
- pgd_t *dir, unsigned long address, unsigned long end, int dma, int wait)
+ pgd_t *dir, unsigned long address, unsigned long end, int dma, int wait,
+ int can_do_io)
{
pmd_t * pmd;
unsigned long pgd_end;
@@ -209,7 +213,7 @@
end = pgd_end;

do {
- int result = swap_out_pmd(tsk, vma, pmd, address, end, dma, wait);
+ int result = swap_out_pmd(tsk, vma, pmd, address, end, dma, wait, can_do_io);
if (result)
return result;
address = (address + PMD_SIZE) & PMD_MASK;
@@ -219,7 +223,7 @@
}

static int swap_out_vma(struct task_struct * tsk, struct vm_area_struct * vma,
- pgd_t *pgdir, unsigned long start, int dma, int wait)
+ pgd_t *pgdir, unsigned long start, int dma, int wait, int can_do_io)
{
unsigned long end;

@@ -230,7 +234,7 @@

end = vma->vm_end;
while (start < end) {
- int result = swap_out_pgd(tsk, vma, pgdir, start, end, dma, wait);
+ int result = swap_out_pgd(tsk, vma, pgdir, start, end, dma, wait, can_do_io);
if (result)
return result;
start = (start + PGDIR_SIZE) & PGDIR_MASK;
@@ -239,7 +243,7 @@
return 0;
}

-static int swap_out_process(struct task_struct * p, int dma, int wait)
+static int swap_out_process(struct task_struct * p, int dma, int wait, int can_do_io)
{
unsigned long address;
struct vm_area_struct* vma;
@@ -260,7 +264,7 @@
address = vma->vm_start;

for (;;) {
- int result = swap_out_vma(p, vma, pgd_offset(p->mm, address), address, dma, wait);
+ int result = swap_out_vma(p, vma, pgd_offset(p->mm, address), address, dma, wait, can_do_io);
if (result)
return result;
vma = vma->vm_next;
@@ -272,7 +276,7 @@
return 0;
}

-static int swap_out(unsigned int priority, int dma, int wait)
+static int swap_out(unsigned int priority, int dma, int wait, int gfp_level)
{
static int swap_task;
int loop, counter;
@@ -311,7 +315,7 @@
}
if (!--p->swap_cnt)
swap_task++;
- switch (swap_out_process(p, dma, wait)) {
+ switch (swap_out_process(p, dma, wait, gfp_level != GFP_BUFFER)) {
case 0:
if (p->swap_cnt)
swap_task++;
@@ -336,27 +340,37 @@
int i=6;
int stop;

+ if (current->in_swap_code)
+ printk("Should not happen\n");
+ current->in_swap_code = 1;
/* we don't try as hard if we're not waiting.. */
stop = 3;
if (wait)
stop = 0;
switch (state) {
do {
- case 0:
- if (shrink_mmap(i, dma))
- return 1;
+ case 0:
state = 1;
- case 1:
- if (shm_swap(i, dma))
+ if (shrink_mmap(i, priority, dma)) {
+ current->in_swap_code = 0;
return 1;
+ }
+ case 1:
state = 2;
- default:
- if (swap_out(i, dma, wait))
+ if (priority != GFP_BUFFER && shm_swap(i, dma)) {
+ current->in_swap_code = 0;
return 1;
+ }
+ default:
state = 0;
+ if (swap_out(i, dma, wait, priority)) {
+ current->in_swap_code = 0;
+ return 1;
+ }
i--;
} while ((i - stop) >= 0);
}
+ current->in_swap_code = 0;
return 0;
}

--- ./mm/filemap.c.orig-1 Tue Jun 17 15:04:05 1997
+++ ./mm/filemap.c Tue Jun 17 16:02:23 1997
@@ -125,7 +125,7 @@
}
}

-int shrink_mmap(int priority, int dma)
+int shrink_mmap(int priority, int gfp_level, int dma)
{
static int clock = 0;
struct page * page;
@@ -181,7 +181,7 @@
}

/* is it a buffer cache page? */
- if (bh && try_to_free_buffer(bh, &bh, 6))
+ if (bh && gfp_level != GFP_BUFFER && try_to_free_buffer(bh, &bh, 6))
return 1;
break;

--- ./mm/page_alloc.c.orig-1 Tue Jun 17 16:01:29 1997
+++ ./mm/page_alloc.c Tue Jun 17 16:46:13 1997
@@ -214,7 +214,7 @@
return 0;
}
restore_flags(flags);
- if (priority != GFP_BUFFER && try_to_free_page(priority, dma, 1))
+ if (try_to_free_page(priority, dma, 1))
goto repeat;
return 0;
}
--- ./include/linux/sched.h.orig-1 Tue Jun 17 15:05:09 1997
+++ ./include/linux/sched.h Tue Jun 17 15:14:55 1997
@@ -218,6 +218,7 @@
/* mm fault and swap info: this can arguably be seen as either mm-specific or thread-specific */
unsigned long min_flt, maj_flt, nswap, cmin_flt, cmaj_flt, cnswap;
int swappable:1;
+ int in_swap_code:1;
unsigned long swap_address;
unsigned long old_maj_flt; /* old value of maj_flt */
unsigned long dec_flt; /* page fault count of the last time */
@@ -297,7 +298,7 @@
/* timer */ { NULL, NULL, 0, 0, it_real_fn }, \
/* utime */ 0,0,0,0,0, \
/* flt */ 0,0,0,0,0,0, \
-/* swp */ 0,0,0,0,0, \
+/* swp */ 0,0,0,0,0,0, \
/* rlimits */ INIT_RLIMITS, \
/* math */ 0, \
/* comm */ "swapper", \
--- ./include/linux/mm.h.orig-1 Tue Jun 17 15:04:44 1997
+++ ./include/linux/mm.h Tue Jun 17 16:00:47 1997
@@ -295,7 +295,7 @@

/* filemap.c */
extern unsigned long page_unuse(unsigned long);
-extern int shrink_mmap(int, int);
+extern int shrink_mmap(int, int, int);
extern void truncate_inode_pages(struct inode *, unsigned long);

#define GFP_BUFFER 0x00