OOM

From: Claus Fischer (claus.fischer@intel.com)
Date: Mon Jul 17 2000 - 13:07:07 EST


Here's a short summary for discussion.

* We have little OOM expertise

  Most people decide to add `enough swap' to make the problem
  go away. Accumulation of expertise, statistics, models, etc.
  doesn't happen and isn't usually desirable.

* A kernel handler is needed

  - to improve on the current `random killing'
  - because it is the kernel's responsibility to make the bare
    system survive, especially on remotely operated or
    unattended machines
  - to provide a reliable path of events through and beyond the
    OOM situation (most importantly, notify users & admins)
  - as a fallback if a userspace handler fails
  - because many systems will not have proper userspace
    handlers set up
  - it does not preclude a userspace solution

* A userspace handler is desirable

  - it can only be a supplement to the kernel handler
  - preparing for OOM is part of responsible sysadmin
  - policy can be adapted to specific needs
  - a good kernel space default will often be good enough

* Policy in kernel

  - every selection of a process to kill is a `policy';
    the current one is bad
  - in-kernel `policy' must be simple and general [1];
    better policies should go through userland

* Notification

  - of admins through syslog
  - of process owners through a userspace daemon

* Killing threshold

  - detection of OOM is the really tricky part
  - a `low on VM watermark' is often a good thing:
    When swap gets full, the system can become very slow.
    You want to overcome the OOM situation quickly and go on.
    Instead of using 2 GB swap, better use 2.5 GB and start killing
    when 2 GB is reached. That improves your batch queue turnaround
    and reduces time wasted on too large jobs.
  - watermarks are policy and belong entirely in userland;
    they are just too dependent on the usage scenario

Claus

-----------------
Footnote:

[1] On Rik's OOM killer algorithm:

Andrea Arcangeli has recently called for using a simple `allocation
rate' scheme. Starting from that approach, which seems best suitable
for a general in-kernel default behaviour, some observations:

* Rate = Pages / Time

* Time can be physical (Wall-clock, CPU) or virtual (a counter that
   is incremented in the VM handler, or per process)

* Time can be weighed: You could take all time since process creation,
   or put an emphasis on recent behaviour (i.e. `fade out' older events)

* Which of the multiple choices for `time' is best is a matter
   very dependent on the usage scenario

* Too short memories bear a risk of re-instituting current behaviour,
   i.e. the process that happens to require a page(s) is punished for
   another process' misdeeds.

* It is very hard from the few case reports we have to come up with
   a good universal scheme for `decay' of time. The scatter of
   usage scenarios probably makes it a bad idea to pick one single
   decay based scheme over another.

* Examples and counter-examples for certain choices of decays
   probably can be constructed but are of little practical value.

Based on these points I favour a solution which uses the times that
are very basic intrinsic process properties in the OS: Wall-clock
lifetime and CPU time consumed. Simple is better.

If you have more specific knowledge about your situation you should
likely go userspace.

Which heuristics/mixture of such times and VM size one uses to
compare processes is largely a matter of personal taste (i.e.
witch magic) but it seems that most reasonable combinations
of these times would in practice tend to pick the same processes
that Rik's scheme picks.

The added flavour (favour root processes, favour processes with
hardware access) in Rik's heuristics is welcome use of easily
available stuff or can at least be tolerated, depending on your
point of view. One probably shouldn't overdo that.

Perhaps a per-process bit (don't kill me) which could be set by
userspace through /proc might be an easy and straightforward
addition to Rik's kill selector, and allow a bit of policy injected
from userspace.

-- 
claus.fischer@intel.com   Intel Corporation SC12-205 ... not speaking
phone   +1-408-765-6808   2200 Mission College Blvd.           for Intel
fax     +1-408-765-9322   Santa Clara, CA 95052-8119

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Jul 23 2000 - 21:00:09 EST