2.6.31 and OOM killer = bug?

From: Anton Starikov
Date: Sun Feb 14 2010 - 18:43:22 EST


Hi,

The setup:
is 16-core opteron node, diskless with NFS root, swapless, 64GB of RAM. Operating under OpenSUSE 11.2. With kernel version 2.6.31. Although it isn't vanilla, I think probably more right is to submit this into LKML.

The problem:
On this node user run MPI job with 16 processes, local job by using shared memory communication.
At some point this processes are trying to use more memory that available.
Normally, all of them or part of them would be killed by OOM killer, and it use to work for years over many versions of kernel.

Now, with fresh setup I got something new. OOM tried to kill, but didn't succeed, and even more, brought system in unusable state. All those processes are locked and un-killable. some of other processes are also locked and un-killable/inaccessible. kswapd consume 100% CPU (which I think is expected behavior when there is no free memory).
No free memory obviously, cause all original processes are still in memory.

I tried to test OOM behavior and it always happens like that now.

Here I attach full gzipped log of all related information captured by logserver (sent by logserver and netconsole, so it can be partly doubled). Sorry that it is too big, but I didn't know what information can be important.

Anton.



Attachment: fixedlog.txt.gz
Description: GNU Zip compressed data