Re: A true story of a crash.

Albert D. Cahalan (acahalan@cs.uml.edu)
Sat, 15 Aug 1998 20:00:09 -0400 (EDT)


Matt Agler writes:
> On Sat, 15 Aug 1998, Albert D. Cahalan wrote:

>>> It would be better and simpler to let the user or admin decide what to
>>> kill. Instead of killing a process, we should put it to sleep.
>>
>> End result: 100% memory use, 100% idle, all processes stopped.
>
> That depends on implementation. Of course, if you let every last page
> get used before doing anything, you're stuck.

How do you _not_ let every last page get used? The first obvious
problem is overcommit, which you'd have to disable. That makes the
amount of swap space you need go way up, and the system will have
to refuse memory allocations even when you have plenty of unused swap.
The second obvious problem is the inability to determine what processes
should be able to access the reserved memory. Remember that normal
user logins run as root before they change UID, and daemons can grow.
The third obvious problem is kernel memory usage. The kernel will grab
whatever memory it needs to satisfy interrupt handlers. What if X and
in.telnetd are stopped?

>>> If the machine has overextended itself, we're probably swapping like mad
>>> already. It's hammered. We're not getting anything done. We don't need
>>> efficiency anymore. We want recovery without loosing in-process work.
>>
>> Not possible.
>
> Gee that was fast.

Yes, it was. Something _must_ be killed. The kernel _will_ make that
decision. The kernel can be dumb or really dumb. I prefer "dumb" over
the "really dumb" we have now.

>>> For example, let's put each process, that asks for a page that we can't
>>> give, to sleep (from do_no_page?). This would be a special sleep in that
>>> it doesn't wakeup until we return to a certain threshold of free memory.
>>> What would happen is that it's pages would age and get thrown out.
>>
>> Thrown out? You must mean that literally, since there may be no more swap.
>> The process will be really messed up if you send pages to /dev/null.
>
> No, I was referring to text pages. They don't use swap. Sorry.

OK, you throw out all the text pages. The system thrashes madly (for days!)
until you have 0 text pages in memory. Now what?

> option besides SIGBUS. If a production machine's kernel suddenly kills
> off your main app because you mistakenly underestimated resources, you may
> want it to wait until you have a chance to give it more resources
> instead of just killing it off and making you start over. ext2fs reserves
> a bit of space for the admin. Perhaps a bit of swap should be reserved
> also. The admin now has _MORE_ options.

Option 1: reset button
Option 2: power switch
Option 3: power cord

Until a human admin decides, the system sits idle and useless.
Imagine that on a remote Linux machine 100 miles away.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.altern.org/andrebalsa/doc/lkml-faq.html