OOM killer problems

From: Denis Lunev (den@asplinux.ru)
Date: Tue Jun 11 2002 - 06:11:03 EST


Hello, All!

1. 2.4.16 kernel is used on 2xP3-866 SMP machine. There is no
   difference in 2.4.18
2. 2 problems found:
   a) Kernel threads can be killed under some circumstances. It
      looks like 2 OOMs were triggered simultaneously on different
      CPUs, see records from 18:11:06. The process exited and
      became zombie (mm == NULL). So, on the second CPU kernel
      threads were killed.
   b) Another problem is out_of_memory(). It detects OOM while huge
      amount of free pages is present. According to logs (record from
      18:10:38) there are 100k pages present in all zones and OOM
      still happens. From my point of view, some processes were
      scheduled from try_to_free_pages(). After they wake up, they
      still can't shrink anything as there is nothing to shrink due to
      prevous OOM and a lot of free memory present.

The log below was obtained adding some debug printk's into the generic
kernel code, the patch is attached.

Jun 10 18:10:28 ts6 before kill free=1313 spoofers=293 killed=0 eaten=1699035
Jun 10 18:10:28 ts6 Task to kill 3873 (MM=ede83be4, UBover=115267)
Jun 10 18:10:28 ts6 Out of Memory: Killed process 3873 (tstspoof), flags=40 mm=ede83be4.
Jun 10 18:10:38 ts6 before kill free=100571 spoofers=291 killed=0 eaten=1583158
Jun 10 18:10:38 ts6 before kill free=100571 spoofers=291 killed=0 eaten=1583158
Jun 10 18:10:38 ts6 Task to kill 1702 (MM=ec42cbe4, UBover=39)
Jun 10 18:10:38 ts6 Out of Memory: Killed process 1702 (sshd), flags=140 mm=ec42cbe4.
Jun 10 18:10:38 ts6 Task to kill 1716 (MM=ec42c784, UBover=17)
Jun 10 18:10:38 ts6 Out of Memory: Killed process 1716 (update), flags=40 mm=ec42c784.
Jun 10 18:10:48 ts6 before kill free=81495 spoofers=291 killed=0 eaten=1583158
Jun 10 18:10:48 ts6 Task to kill 3890 (MM=f3b03dc4, UBover=0)
Jun 10 18:10:48 ts6 Out of Memory: Killed process 3890 (tstspoof), flags=40 mm=f3b03dc4.
Jun 10 18:10:56 ts6 before kill free=61978 spoofers=290 killed=0 eaten=1577700
Jun 10 18:10:56 ts6 Task to kill 3891 (MM=efc49bc4, UBover=0)
Jun 10 18:10:56 ts6 Out of Memory: Killed process 3891 (tstspoof), flags=40 mm=efc49bc4.
Jun 10 18:11:06 ts6 before kill free=33945 spoofers=289 killed=0 eaten=1572242
Jun 10 18:11:06 ts6 Task to kill 3892 (MM=eba244a4, UBover=0)
Jun 10 18:11:06 ts6 before kill free=32906 spoofers=289 killed=0 eaten=1572242
Jun 10 18:11:06 ts6 Task to kill 3892 (MM=eba244a4, UBover=0)
Jun 10 18:11:06 ts6 Out of Memory: Killed process 3892 (tstspoof), flags=40 mm=eba244a4.
Jun 10 18:11:06 ts6 Out of Memory: Killed process 2 (keventd), flags=40 mm=00000000.
Jun 10 18:11:06 ts6 Out of Memory: Killed process 3 (ksoftirqd_CPU0), flags=40 mm=00000000.
Jun 10 18:11:06 ts6 Out of Memory: Killed process 4 (ksoftirqd_CPU1), flags=40 mm=00000000.
Jun 10 18:11:06 ts6 Out of Memory: Killed process 5 (kswapd), flags=840 mm=00000000.
Jun 10 18:11:06 ts6 Out of Memory: Killed process 6 (bdflush), flags=40 mm=00000000.
Jun 10 18:11:06 ts6 Out of Memory: Killed process 7 (kupdated), flags=40 mm=00000000.
Jun 10 18:11:06 ts6 Out of Memory: Killed process 8 (scsi_eh_0), flags=40 mm=00000000.
Jun 10 18:11:06 ts6 Out of Memory: Killed process 9 (scsi_eh_1), flags=40 mm=00000000.
Jun 10 18:11:06 ts6 Out of Memory: Killed process 10 (mdrecoveryd), flags=40 mm=00000000.
Jun 10 18:11:06 ts6 Out of Memory: Killed process 3892 (tstspoof), flags=1c44 mm=00000000.

Regards,
        Denis V. Lunev


--- linux.inodes/mm/oom_kill.c Fri Jun 7 12:53:10 2002
+++ linux/mm/oom_kill.c Mon Jun 10 17:40:42 2002
@@ -256,6 +256,30 @@
          */
         if (++count < 10)
                 return;
+
+ {
+ struct task_struct *ptr;
+ int cc = 0, zeroed = 0;
+ unsigned long vm_size = 0;
+
+ read_lock(&tasklist_lock);
+ for_each_task_all(ptr) {
+ if (strncmp(ptr->comm, "tstspoof", 8))
+ continue;
+ cc++;
+
+ spin_lock(&mmlist_lock);
+ if (ptr->mm)
+ vm_size += ptr->mm->total_vm;
+ else
+ zeroed++;
+ spin_unlock(&mmlist_lock);
+ }
+ read_unlock(&tasklist_lock);
+
+ printk("before kill free=%u spoofers=%d killed=%d eaten=%lu\n",
+ nr_free_pages(), cc, zeroed, vm_size);
+ }
 
         /*
          * Ok, really out of memory. Kill something.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sat Jun 15 2002 - 22:00:22 EST