On Fri, 14 Aug 1998, Ian and Iris wrote:
> A true story:
>
> The time is 12:05 pm CST. The date is now. You are merrily using your personal
> Linux 2.1.115 system, testing Communicator 4.5 PR1, when all of the sudden and
> out of the blue, the hard drive starts cranking ever harder. Xload scales down a
> few times as the load average goes balistic. Quickly the machine grinds to a
> halt. The mouse won't move - you can't even change virtual consoles. Still the
> hard drive thrashes. You remember that you compiled the Magic SysRq key in, so
> in desperation, you try it. Alt-SysRq-K. There. You won't be able to use the
> console until you reboot (notwithstanding various uncouth dosemu tricks) but at
> least the system has stopped thrashing.
>
> It's happened before, and every time, you became angry, but at the time, had no
> proof of what the problem was. Linux, you figured, had about as much chance of
> crashing as, say, a mountain. You were wrong. Dead wrong.
>
> Fortunately, you had a window of "top" running. Curiously, you notice that
> kswapd was on top, with 100 cpu. Then it hits you. A great big ZERO under the
> "free" column for your swap space. You were out of swap.
>
> Alt-SysRq-s-s-s-u-u-b. The machine reboots.
>
> A though strikes you. You strike back. Then you realize what it was. The machine
> Capital *MUST* have a way of coping when it runs out of memory. The machine did
> NOT cope.
>
> You run rampantly through the Kernel Source, looking for the pointers to the
> maintainers. Searching on "mem" and "mm" you find nothing - a few e-mail
> addresses match "mm" but that's all. You try "swap" but no luck there either.
> Perhaps there is no support for the memory management subsystem? But it keeps
> getting updates and patches. There must be SOMEONE working on this. To no avail
> you search and search.
>
> Exhausted, you decide to post your story to a few places. Cautiously you begin
> to consider the implications of kernel-hacking. Many times have you looked in
> awe and wonder at the depth of the source, but never has your hacking hand
> strayed from the safe world of userland applications. Daunted, you begin to
> consider the alternatives. Would it be better to try to monitor the free space,
> and compensate? How to compensate? Should one add more swap buffers on demand?
> This would be tricky - and what if the program got swapped out? Should you look
> for big processes and kill them? What if the problem was many small processes?
> Perhaps the most hungry user gets a SIGTERM, then later on a SIGKILL? You
> quickly decide that root should be exempted. You remember mlock(), and then you
> remember that you've never even tried it. To hang such an important decision on
> a program which may not even ever get to run seems precarious, at best.
>
> You decide that the WAY must be to patch kswapd, so it knows when its mission is
> futile, and invoke a more aggressive procedure.
>
> The machine must stay up!
>
> After some thought, you consider that fork-bombs are nowhere near as common on a
> relatively well-behaived "Personal" system as is running out of memory. Thus, it
> makes sense to kill the largest process not owned by root unless there are no
> more, then the largest process owned by root as long as it's not init, then just
> give up on the theory that if init wants to take down the system there are
> other, larger problems.
>
> Why largest? It's probably the out-of-control one.
>
> What would users think if their process suddenly disapeared? They should be
> given a warning on the associated TTY that the process was killed due to lack of
> memory, along with a brief summary of what the process was. If there is no
> associated TTY, then any tty owned by that user would do. If that fails, eg if
> the user is not logged in, then oh well. Also, any time this happens, it would
> be wise to note the occurance in the syslog, along with some details such as who
> owned the process, how big it was, and what the command line was. This would
> enable log file analysys to discover frequent offenders.
>
> For politeness, it's probably wise to send a SIGTERM, wait one second, and then
> send a SIGKILL if necessary. During the second, new processes and memory
> allocations would fail because memory is full, but this already has an
> error-handling infrastructure behind it.
>
> Now to find some way of getting this done.
>
> Suggestions are welcome.
>
> Ian
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.rutgers.edu
> Please read the FAQ at http://www.altern.org/andrebalsa/doc/lkml-faq.html
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.altern.org/andrebalsa/doc/lkml-faq.html