Re: [discussion] Swap overcommitment recovery

Rik van Riel (H.H.vanRiel@phys.uu.nl)
Tue, 18 Aug 1998 19:21:53 +0200 (CEST)


On Mon, 17 Aug 1998, Matt Agler wrote:

> Here's a summary of my idea. I'd really be interested in any additional
> comments you may have
>
> Problem: When linux runs out of swap, kswapd uses all CPU, essentially
> hanging the system.
> Solution: Detect when swap is about to be exhausted and recover the
> machine by reducing CPU and disk usage to noncritical levels and then
> running a recover procedure in userspace.

Sounds reasonable :)

> 1) Add a flag to the kernel to indicate that we are in recovery mode.
>
> 2) Modify the swap subsystem to set the flag when we reach a configurable
> threshhold. (ex. 90% utilization) Reset the flag when we drop back below
> this threshhold. 2.1) When the flag is set, signal init to run the
> nomemory action for the current runlevel.

This won't work when the system is thrashing itself to death.

> 3) Modify the vm subsystem to check this flag. If it is set, any non-root
> process that attempts to use more swap by faulting on a nonexistent but
> legal page (COW) is put to sleep.

This might reduce the thrashing a little, but things like mmap(),
swap cache and in-kernel things (NFS, TCP) blur the picture so much
that there's not much left of this ideal...

> 4) Modify the scheduler to not wakeup these specially sleeping processes
> until the flag is reset.
>
> 5) Modify init to allow for a nomemory action.

While init might start up the nomem thingy, the system might already
be so slow that it won't be workable anymore. The picture about how
far VM is used is more complex than most people think.

My killing code has been ready for over 3 months, the real problem
was deciding when to start the killing code. Things like cache,
swap cache, buffers, NFS/TCP buffers, and swap usage (swap cached
stuff -- is it used or free?) blur the picture so much that there's
no real way to test for OOM situations.

Besides, when the system is running happily with all of swap filled
up, there's no reason to start killing yet... Imagine the situation
where all the swap is used, but 1/4th of swap space is also swap
cached and half of memory is used by buffer+cache. In most tests
proposed in these threads some kind of killing code would have
been kicked in by now -- completely unneeded and with possibly
disastrous results.

> Precident:
>
> 1) ext2fs reserves space for root, so why not reserve swap for root.

Because memory is much more 'volatile' and dynamic than FS space.
All the different kinds of caches (some 6 to 10 types!) and other
stuff complicate things beyond recognition...

> 2) init has actions for critical states like powerfail and ctrlaltdel
> (critical because root wants us to die), so why not a nomemory action.

Because when we're really out of memory we can't start the nomem
script. And when we're not out of memory yet, we shouldn't start
killing stuff.

> Benefits:
>
> 1) The kernel can avoid the consequences of optomistic memory allocation
> by defering the hard decisions to userspace.

This doesn't mean that the hard decisions won't have to be
taken. An in-kernel solution has the advantage of reliability
and more strict code review. A userspace solution will most
likely result in a lot of nonfunctional or buggy implementations.

This is mostly because of the incredible complexity of the MM
system and the fact that the people who know that complexity
usually don't hang out in userspace :)

> 2) The kernel will never swap to death. At worst it will hang in a state
> where root can login and do stuff without loosing any inprocess work. At
> best the nomemory action of init will cleanup whatever has gone wrong and
> the machine can continue unattended.

No way. Even if VM isn't exhausted yet, the system _can_ already
be swapping itself to death. There's no guarantee that saving
the last 2 megs will allow a userspace solution to actually get
something done in a useful timespan.

> 3) All policy is left to userspace. Processes aren't killed unless
> userspace decides to kill them. Everything is fully configurable.

Configurability can be a nice point. I believe, however, that the
complexity of the MM system combined with the ignorance most people
have about the system will result in worse configurations than a
standard in-kernel configuration that has been reviewed by all the
VM gurus. This is mostly because some parts of the VM system work
somewhat counterintuitive.

> 4) Processes that are not allocating memory are not penalized by the
> kernel. (although an indescriminate nomemory action script may do so -
> but that's another problem)

This is nicely adressed by looking at the CPU time consumed and
the time the process has been running. What about your long-running
simulation that needs just a few pages to save it's results? Won't
it be better to kill of some new process which hasn't yet done much
useful work anyway?

> Notes:
>
> 1) Runaway root processes can still hang the system - but so what?

We don't want the system to hang; root or no root.

Rik.
+-------------------------------------------------------------------+
| Linux memory management tour guide. H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader. http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.altern.org/andrebalsa/doc/lkml-faq.html