Virtual vs. physical swap & shared memory forks (sprocX)

From: Linda Walsh (law@sgi.com)
Date: Sat Mar 25 2000 - 03:05:39 EST


David Whysong wrote:
> Non-deterministic with the current kernels that randomly kill things, yes.
> I certainly don't like the situation. But "fixing" the problem by adding
> new system calls isn't a good solution -- you've redefined the problem
> such that all current software is broken, and needs to be rewritten to use
> your syscall.

---
	All?  That's a pretty strong statement.   Programs that currently
behave will continue to run.  Programs that spawn off hundreds of 40 megabyte
processes are being careless.  They are relying on the non-determistic
operating system not enforcing memory restrictions.  

An additional facility -- at the administrator's option they can add "vswap" (also from IRIX) "Virtual Swap space":

A file that the system consideres to be a certain size (e.g. 400MB) but actually occupies no disk space. This is useful because many programs request much more swap space than they really need in order to run, and tie up the real swap space unnecessarily. When you add virtual swap space, the system lets you start applications even when they request more swap space than is actually available. In most cases this is fine, because there is enough real swap space for them to run.

This way you can allow overcommitment if you run applications that need it, but mission-critical applications can be run on a system with only 'physical' swap.

This would require no reprogramming of bad apps, but an admin would have to explicitly enable some amount of virtual swap space.

The "-v" option for swap (IRIX) says:

-v vlength Set the virtual length of the swap area to vlength. Normally, if this field is not specified, the amount of logical swap is increased by the size of the physical swap device being added. This option tells the system to increase the logical swap amount by vlength. Thus, the difference between vlength and the actual size of the device is amount of virtual swap that is added. The virtual length must be greater than or equal to the physical length of the swap area. If not specified, the virtual length is set equal to the actual length. See swapctl(2) for a discussion on virtual swap spaces. In general, this option should only be used when there is no other way to get enough swap resources (even via NFS) and it is understood the potential problems creating virtual swap can cause. See the discussion of Logical Swap Space below. ... Virtual swap discussion:

Programs that have large address spaces and large programs that fork, may receive EAGAIN along with the "out of logical swap space" message on the console. This can also happen when debugging a large program with dbx or other debugger. There are two ways to avoid this error: adding more real swap space, or adding virtual swap space. Adding real swap space means allocating an additional disk partition or a regular file (either local or remote via NFS) to be used as a swap device (using the -a option shown above and the examples below). This is the required approach for programs that use most of the virtual addresses they allocate. The advantage of this approach is that it continues to avoid memory deadlocks, but requires physical disk space to be allocated.

The alternative is to add virtual swap space using the -v option. This increases the amount of logical swap space without using any physical disk space. This is suitable when the programs involved do not intend to use the virtual address space they allocate (i.e., when the address space is sparse or when a large program that forks intends to exec soon afterwards without modifying many pages). In these cases, physical swap space is not required and so adding virtual swap space allows the kernel to complete the logical swap space reservation and avoid the EAGAIN errors. The advantage of this approach is that it does not require any disk space, but adds the risk of encountering a memory deadlock. Memory deadlocks occur when the system has over-committed logical swap space by allowing the total private virtual space of the processes on the system to exceed real swap space. When processes attempt to use the allocated virtual space, the kernel has no place to store the data (since virtual swap space has no associated disk space), and a memory deadlock results. In these instances, the kernel kills one or more processes to free up enough logical swap space to break the deadlock. For this reason, virtual swap space should not be used in cases where the program will attempt to use the memory. For example, programs that expect malloc(3C) to return NULL when there is no more memory will in fact be allocated virtual memory that they could not use without causing a memory deadlock. The -v option should therefore be used with care.

> A better solution is to impose sane, deterministic behavior in the > overcommitted case. This can be done with optional memory quotas in > conjunction with Rik van Riel's kernel patch. But removing overcommit > doesn't solve anything. --- Sure it does. If you run out of memory, then 'malloc' will return NULL. (Yeah, I'm changing my story on the fly -- default return failure unless vswap is used....then we can have the above)

> > > First, the kernel should reserve some amount of memory so it will > >never run out of memory. > > ...and that's hard to do. AFAIK Linux reserves a fraction of memory for > the kernel (256 pages on my machine), but doesn't guarantee anything > beyond that. --- So it dynamically reserves 256 more pages than it is currently using, so it will be 256 pages away from being out of memory when it realizes there is a problem? So kernel running out of memory shouldn't ever happen -- as it should always have a 256-page buffer more than it is currently using.

> > >Ideally, there should be two limits. One level would require processes > >have UID==0 (or some CAP - CAP_USE_RESERVE_SPACE) to alloc beyond, a > >second the kernel reserves for itself. If all processes become blocked > >on waiting for memory, the kernel starts killing user-level processes > >with the largest first. Probably another CAP for CAP_DONT_KILL_FOR_MEM > >to protect system processes executing in user space. --- Note, I'm serious about the above CAP's. Again -- if you want to protect your 'X', you can make sure it runs with the "don't kill cap".

> > Killing from largest to smallest isn't a good idea. That often makes the X > server go first. Have a look at Rik van Riel's OOM killer patch for a > better example. I think that the policy of what process to kill should be > configurable. --- Can you describe it's behavior? I don't happen to have a copy, but if it's a good algorithm, it should be deterministic and well documented.

> I don't see how this solves anything. We already have vfork(), --- Vfork? You mean this one (linux manpage):

BUGS Under Linux, vfork is merely an alias for fork.

--- That's not very useful.

> unfortunately. And I'm not enough of a kernel hacker to see the difference > between sproc() and vfork(). Heck, my manpages don't even describe the > difference between fork() and vfork()... --- The one's included with SuSE, RedHat state the above. Mandrake's states: BUGS It is rather unfortunate that Linux revived this spectre from the past. The BSD manpage states: "This system call will be eliminated when proper system sharing mechanisms are implemented. Users should not depend on the memory sharing semantics of vfork as it will, in that case, be made synonymous to fork." Formally speaking, the standard description given above does not allow one to use vfork() since a following exec might fail, and then what happens is undefined.

> Again, this isn't very meaningful. Any non-deterministic behavior isn't a > result of overcommitment, it's due to the fact that the kernel hasn't been > informed of what to do when OOM. That can be fixed without removing memory > overcommitment. Just implement quotas, or alternately task priorities and > have the kernel kill the lowest priority tasks first. After all, by the > time you start killing tasks on an overcommitted system, you would have > been killing tasks long before without overcommit... --- No -- you'd return failures on malloc (assuming no virtual swap). Not random or senseless killing. Then each app can choose what to do when it runs into an out of memory condition instead of expecting that the sysadmin will know the correct behavior for every app running on the system.

> The problem is not overcommit. The problem is that the system doesn't > handle OOM well. It would be better to solve the problem than cover it up > under some new system call. --- The system would handle it just fine if you returned NULL on mallocs or ENOMEMs/EAGAINs on forks. So what would you want 1) when I'm in vi, I attempt to spawn a shell it returns "insufficient memory", or 2) the system starts deciding by some sysadmin set policy about what to kill first.

The user of a such a system wouldn't know what to expect.

-l

-- Linda A Walsh | Trust Technology, Core Linux, SGI law@sgi.com | Voice: (650) 933-5338

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Fri Mar 31 2000 - 21:00:15 EST