Re: Some questions about linux kernel.

From: Jesse Pollard (
Date: Tue Mar 21 2000 - 13:52:54 EST

--------- Received message begins Here ---------

> On Mon, 20 Mar 2000, Richard B. Johnson wrote:
> > Malloc(), as stated before, just sets a new break address when it
> > runs out of heap. It keeps track of the heap, but not very carefully.
> [ paging description snipped]
> > A problem occurs when there are no longer any free pages to steal.
> No. The problem we're discussing occurs when there's no more backing
> storage on the swap device(s). If there aren't any "stealable" page-frames,
> that's another matter. It means the whole RAM is 'locked' by some means.
> It's a deadlock situation, AFAIK, and SHOULD never happen. And I think
> it can happen only with a kernel bug.
> > When malloc() attempts to set a new break address, it sets up a
> > handler. Then it calls the kernel to set a new break address.
> > Malloc(), before accepting this address, could write a word of
> > zeros to the top allocation. This could cause a page-fault. If
> > the page-fault handler could not fault in a new page, it could
> > send a signal to the process (received my malloc()). Malloc
> > can then return NULL for the current allocation request. In this
> > manner, the caller of malloc() would always be assured that memory
> > was available.
> It has to touch all the pages. And, even if probably its not so in
> the current implementation, I wonder if, once a page is paged-in,
> its space on the swap device is made available to other pages or not.
> It could be. In such a case, you can't be sure even if you touch all
> the pages at malloc time.
> > Unfortunately, this is naive. The first time the break address was
> > extended, this would work. However, what happens after the kernel
> > steals pages from your task to satisfy other requests? Eventually
> > pages that you thought you owned, have to be faulted in. There may
> > be no more pages to steal so you, thinking you have safely allocated
> > real pages, are now deadlocked --and dead.
> What do you mean with 'steal'? You mean paging out or in (and freeing
> swap space)? Page faults are not a problem here. It's paging-out
> when there no more swap storage.
> > The only solution to an out-of-memory condition is to never run
> > out of memory. The place where all of the system information is
> > known is in "user space". The kernel readily "knows" stuff about the
> > current process, but retrieving information about other tasks in
> > a page-fault handler would result in an extremely poor performing
> > machine. A user-space daemon can acquire information about all the
> > tasks, can detect runaway tasks, can safeguard special tasks like
> > Web Servers that haven't gone crazy, and can watch for performance
> > hurting rogue programs.
> >
> > Such a program, if properly designed, is the solution to such
> > out-of-memory conditions.
> User or kernel land does not really matter. Such a program, either
> 'overcommits' or not. If it does, it has to choose deadlock or kill.
> If it does not, the kernel could do that pretty as well.
> >
> > Cheers,
> > Dick Johnson
> >
> > Penguin : Linux version 2.3.41 on an i686 machine (800.63 BogoMips).
> >
> >
> [What follows is an aswer to many other messages posted on the
> matter, by many authors - basicly my thoughts on the matter]
> In the whole thread, people keep misusing the words "OOM", "memory",
> "virtual" and so on. I agree that what you write is almost true. I'd like
> to make clear that "virtual RAM" is something that does not exist.
> There's RAM, call it just "memory". There's swap space, which is used
> to support paging (AFAIK, Linux does not do swapping). And there's
> a process virtual address space, which is represented as PTEs (and
> other structures) in kernel land. You never, stricly speacking, go OOM,
> if you do paging. The kernel already let processes grow in size more than
> the available memory. During normal operation, if you see >0% swap
> utilization, you're OOM. But it is known that a process does not need
> all the memory it allocates at once. So, to increase system throughput
> (the number of processes you can run in a given time interval) and
> concurrency (strictly related), you use paging. You never go OOM here
> (infinite swap space assumed, of course), but you do get problems if the
> amount of memory *needed* (not just allocated) by all processes at a
> given time is bigger than the RAM you have: the mm system just keeps
> paging in and out, reducing memory accesses to I/O speed
> (and loading the I/O subsystem at the same time). That's what *I* call OOM.
> You get there slowly, and you're really in trouble when even the I/O
> subsystem can't keep up with it. The system is stalled.
> The whole thread (and related ones) IS NOT about this.
> Since swap space is not infinite, you may fill it up, and the you're OOS
> (out of swap). It's not OOM. The behaviour is completely different. With
> just few hundreds KB free (on the swap device), the system is all fine.
> A few seconds later, it's completely stalled, in such a unrecoverable way
> that (for many people) a process Killer is the only solution.
> Actions that should be taken when OOS is detected is what people are
> discussing now.
> "Virtual memory" is a completely different, unrelated thing. The kernel
> provides a private virtual address space to processes. Memory addresses
> are "virtual", meaning that you write at address 0xff00, but only the
> kernel knows where the process is writing in RAM. This also means that
> you can run 10 copies of 'ls', each writing to addres 0xff00 but actually
> writing to 10 different page-frames in RAM, the processes *not being aware*
> of it: that's why its called "virtual". The whole thing is "virtual
> address space" not "virtual memory". You could have paging WITHOUT addresses
> translation, and processes could even handle that by themselves (think of
> "overlays"). And you can have VA WITHOUT paging. Of course, virtual
> addressing just integrates too well with a paging system, so we have both.
> The whole thing about "malloc / overcommiting" arose because in legacy
> UNIX implementations the kernel *before* extending a process VA space,
> it allocated enough backing storage on swap to *completely* swap it
> out ('swap' it, not 'page' it!). This is required for 'swapping'.
> It has been noticed that no all the processes will be swapped out at
> the same time. That is, the kernel does not use all the swap space
> it allocates at the same time. The same optimization you made for RAM,
> you can make for swap. Again, you increase system throughput. With
> paging, where only part of the process address space is on swap on a
> given time, it's even better.
> Both optimizations come at a price. Without paging, OOM situation is
> handled much better (in a way). And a process address space is always
> there (in RAM), as far as the process is concerned. With (just) swapping,
> this is true. Either the process is completely swapped-out (so being
> 'dead'), or it's address space is completely valid. Yes, can make many
> examples of how useful this can be. For every low-latency application,
> that needs to react in a timely manner to external events, paging is a
> bad thing. When its event loop calls the handle_extern_event_no_1090()
> function (that has not been recently called, so it has been eventually
> paged-out) the application may sleep (completely unaware of it). Is
> paging bad?
> And, BTW, is it fair?

Good summation and review..

> My poor little process, using up just
> 100KB of address space, with a working set of 20KB, can't run smoothly
> because of that big memory hog (simulation) which has 800MB of address
> space, and a woking set of about 62M, which is causing so much paging in
> and out on our 64MB system? Shouldn't paging system see that granting
> me just 2 page-frames more my performances will just double, without
> the big simulation even notice it? You can make worst case example for
> almost every kernel design or implementation choice. Shouldn't 'ls -l'
> perform better, if we put file ownership and permission info in the
> directory entry, instead of in a separate structure?

Your "poor little process, using up just 100KB of address space, with a
working set of 20KB," is using two different, sort of overlapping quotas.

1. resident set quota - your "real memory" was being set by management
   to what they considered reasonable for your application/job
2. virtual limit - they let you have sufficient virtual space to run your
   job, even if it was at a reduced throughput.
3. The "big memory hog (simulation) which has 800MB of address space, and
   a woking set of about 62M" was deemed a more critical (and possibly
   time critical) process. For all I know it may have been a wind tunnel
   simulation for aircraft design studies. This process may really have
   been touching 90% of the 62M on each iteration, and walking through
   the entire virtual space every two iterations. Would it have noticed
   the los of 2 pages hurt it? don't know but probably not.

The quotas were directed by the owners/trustees of the system. That is their
right, privilege, and responsibility. If you can justify expanding your
quotas, I'm sure they would have done so.

BTW - when resident set quotas are used, you are usually given a "minimum
guaranteed amount". If more is avilable, either because the large simulation
is not running, or other users just are not logged in, then the system
allows you to get more (up to the limit of the maximum virtual quota, if
that much is available). When the large simulation starts, then your
(and other users) processes are trimmed (as needed) until you reach the
"minimum guaranteed amount". This allows the memory hog to run with its'
"minimum guaranteed amount".

Note: both processes were allowed to run. There were sufficient real resources
available. If there was enough for the simulation, and you started your
process, which process would have been aborted? the simulation, or yours?
Which was deemed more importatant for the organization at the time?

> malloc() is just a user space "memory" allocator. It's not a kernel
> interface. What ever structures it uses, of what side effects you have
> is out of question. It uses brk() to extend a process VA space when
> needed, in order to handle more *addresses*, not more RAM. The kernel
> does not grant anything but a set virtual addresses.
> In legacy OSes, as a *side effect* of the swap space pre-allocation,
> it turns out that those addresses are always valid, either in RAM or
> paged out. But I don't think this is implied in the brk() semantic.
> Experience has already shown that in a general purpose OS such as Linux
> swap space "overcommitting" is a win. On special cases, of course, it's
> not. Provinding special, worst case examples for actual behaviour
> won't make it disappear. Expecially if the examples are life support systems
> or cruise missiles. Expecially if you're using those example to compare
> Linux and NT based on stability.

It is a win in a single user environment. It is a catastrophe waiting to
happen in a multiuser environment. I need to prevent the failures.

> In pre 2.2 times, I've never seen a Linux box crash for OOS. Nor processes
> being randomly killed. The system just deadlocked. With X running, sometimes
> I managed to "recover" OOS just hitting CRTL-ALT-BACKSPACE and *after a
> few hours*, X exited bringing the memory hog with it, and i had my system
> back. I'm not saying it was good. It was a choice. An OOM (or better OOS)
> killer is another one.

I have, at least I believe thats what happend. I couldn't be there running
top at the instant it failed (it rebooted itself).

> I think your swap space should be dimensioned in a way that you hit
> the OOM situation before a OOS one. And you just want to cure OOM, the
> performaces dropping so much in such a case. So OOS should never happen.
> Special cases do exist. Handle them in a special way.
> Don't use MMU on life-support systems (or missiles).
> Have your big simulation program handle its own (huge) backing store.

How is that enforced?
How does that simplify the difficulty of programming it in the first place?

> Fix buggy software that memory hogs.
> Disable malicious users's accounts, or set ridicusly low resurce limits for
> them. And don't run critical software on general purpose systems.

Weather forcasting is considered critical. Yet they are run on general purpose
systems because there is no such thing a a "weather computer". They are also
considered memory hogs.

> Use RT OSes when needed.
> Use resource limits and *limit the number of concurrent users* if you
> want strict control of them. If availability is you only concern, triple
> your budget.

There are no enforcable resource limits. It is cheaper to use resource controls
and be able to justify the "triple" budget.

> If you want to make best use of available resources, just use Linux (and
> its overcommitting mm system).

Except when the use by multiple users collide without a way to resolve
the conflict other than aborting random user processes, or by crashing.

> What's wrong with that?

Nothing at all. I want to use linux where it is most appropriate. I also see
the lack of the option to have quotas as artificially limiting Linux to
just single user workstations.

BTW - I liked your summation, even if I did have some quibbles with the
trailing parts.

Jesse I Pollard, II

Any opinions expressed are solely my own.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
Please read the FAQ at

This archive was generated by hypermail 2b29 : Thu Mar 23 2000 - 21:00:34 EST