Re: Memory hogs

Claus Fischer (claus.fischer@microworld.com)
Sat, 17 Jul 1999 12:27:00 -0700 (PDT)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Ben Greear: "NFS problem/question."
Previous message: Donald Harter: "2.2.5 kernel build script problem"

On Sat, 17 Jul 1999, Andrea Arcangeli wrote:

: So far I fixed all bugs so that now the machine doesn't die and init
: survive.
:
: Grab the patches here:
:
: ftp://e-mind.com/pub/andrea/kernel-patches/2.2.10/oom-2.2.10-G
: ftp://e-mind.com/pub/andrea/kernel-patches/2.2.10/swapversion2-2.2.10-A
: ftp://ftp.suse.com/pub/people/andrea/kernel-patches/2.2.10/oom-2.2.10-G
: ftp://ftp.suse.com/pub/people/andrea/kernel-patches/2.2.10/swapversion2-2.2.10-A
:
: With these two patches applyed on 2.2.10 you shouldn't be able to deadlock
: the machine and init anymore. You can still overload it to make impossible
: to login but a SYSRQ+I should always be able to restore the system.

Thanks Andrea, I value that work very much.
You are doing a very thorough workover of all code parts relevant
to OOM situations.

I just see the danger that people think it's enough to stop there and have
the rest handled by userspace, where there is no solution anywhere in
sight, and where the kernel emergency behaviour is still not what it
should and could be.

***

I can do some testing of these patches at home. At work, however, where
the problem originally occurred, I cannot develop and test the Linux
kernel. Testing your patches involves:

o Backing up my whole system (it's really my development workstation)
o Notifying all users to remove from my system
o Waiting until all their batch jobs have completed
o Turning off most of the swap
o Rebooting and testing

I know I shouldn't complain about that; if I want to see a new feature
I'd better test it myself. But the point is I HAVE indeed tested it
a year or so ago when Rik made his patches; it's just very tedious for
me to patch the kernel everytime I upgrade my system, so I've switched
over to using 2GB swap and having users take care before they run
possibly large jobs.

That solves my problem but it's clearly not a solution that is acceptable
for a real batch queue system where you would continuously run large jobs.
And any requirement for manual interaction is not acceptable in this
situation, and frankly not even desirable for the personal workstation.
If I have to re-init the system I can as well tell users not to run
any jobs on my desktop machine.

***

A good userspace solution cannot do any better in my situation than
killing the bad job. And even the best userspace solution might leave
loopholes or might not catch the situation in time.

In this case it's again important that the kernel's last-resort
mechanism is somewhat better than going crazy or waiting for
manual interaction.

And I can't think of any situation where killing the really bad job
would make the situation any worse than it already is.

Claus

-- Claus Fischer (claus.fischer@microworld.com)

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/

Next message: Ben Greear: "NFS problem/question."
Previous message: Donald Harter: "2.2.5 kernel build script problem"