Re: cgroup OOM killer loop causes system to lockup (possible fixincluded)

From: Cal Leeming [Simplicity Media Ltd]
Date: Mon May 30 2011 - 17:36:26 EST


FYI everyone, I found a bug within openssh-server which caused this problem.

I've patched and submitted to the openssh list.

You can find details of this by googling for:
"port-linux.c bug with oom_adjust_restore() - causes real bad oom_adj - which can cause DoS conditions"

It's extremely strange.. :S

Cal

On 30/05/2011 18:36, Cal Leeming [Simplicity Media Ltd] wrote:
Here is an strace of the SSH process (which is somehow inheriting the -17 oom_adj on all forked user instances)

(broken server - with bnx2 module loaded)
[pid 2200] [ 7f13a09c9cb0] open("/proc/self/oom_adj", O_WRONLY|O_CREAT|O_TRUNC, 0666 <unfinished ...>
[pid 2120] [ 7f13a09c9f00] write(7, "\0\0\2\240\n\n\n\nPort 22\n\n\n\nProtocol 2\n\nH"..., 680 <unfinished ...>
[pid 2200] [ 7f13a09c9cb0] <... open resumed> ) = 9
[pid 2120] [ 7f13a09c9f00] <... write resumed> ) = 680
[pid 2120] [ 7f13a09c9e40] close(7 <unfinished ...>
[pid 2200] [ 7f13a09c9844] fstat(9, <unfinished ...>
[pid 2120] [ 7f13a09c9e40] <... close resumed> ) = 0
[pid 2200] [ 7f13a09c9844] <... fstat resumed> {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
[pid 2120] [ 7f13a09c9e40] close(8 <unfinished ...>
[pid 2200] [ 7f13a09d2a2a] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
[pid 2120] [ 7f13a09c9e40] <... close resumed> ) = 0
[pid 2200] [ 7f13a09d2a2a] <... mmap resumed> ) = 0x7f13a25a6000
[pid 2120] [ 7f13a09c9e40] close(4 <unfinished ...>
[pid 2200] [ 7f13a09c9f00] write(9, "-17\n", 4 <unfinished ...>


(working server - with bnx2 module unloaded)
[pid 1323] [ 7fae577fbe40] close(7) = 0
[pid 1631] [ 7fae577fbcb0] open("/proc/self/oom_adj", O_WRONLY|O_CREAT|O_TRUNC, 0666 <unfinished ...>
[pid 1323] [ 7fae577fbf00] write(8, "\0\0\2\217\0", 5 <unfinished ...>
[pid 1631] [ 7fae577fbcb0] <... open resumed> ) = 10
[pid 1323] [ 7fae577fbf00] <... write resumed> ) = 5
[pid 1323] [ 7fae577fbf00] write(8, "\0\0\2\206\n\n\n\nPort 22\n\n\n\nProtocol 2\n\nH"..., 654 <unfinished ...>
[pid 1631] [ 7fae577fb844] fstat(10, <unfinished ...>
[pid 1323] [ 7fae577fbf00] <... write resumed> ) = 654
[pid 1631] [ 7fae577fb844] <... fstat resumed> {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
[pid 1323] [ 7fae577fbe40] close(8) = 0
[pid 1631] [ 7fae57804a2a] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
[pid 1323] [ 7fae577fbe40] close(9 <unfinished ...>
[pid 1631] [ 7fae57804a2a] <... mmap resumed> ) = 0x7fae593d9000
[pid 1323] [ 7fae577fbe40] <... close resumed> ) = 0
[pid 1323] [ 7fae577fbe40] close(5 <unfinished ...>
[pid 1631] [ 7fae577fbf00] write(10, "0\n", 2 <unfinished ...>

The two servers are *EXACT* duplicates of each other, completely fresh Debian installs, with exactly the same packages installed.

As you can see, the working server sends "0" into the oom_adj and the broken one sends "-17".


On 30/05/2011 15:27, Cal Leeming [Simplicity Media Ltd] wrote:
I FOUND THE PROBLEM!!!

Explicit details can be found on the Debian kernel mailing list, but to cut short, it's caused by the firmware-bnx2 kernel module:

The broken server uses 'firmware-bnx2'.. so I purged the bnx2 package, removed the bnx*.ko files from /lib/modules, ran update-initramfs, and then rebooted (i then confirmed it was removed by checking ifconfig and lsmod).

And guess what.. IT WORKED.

So, this problem seems to be caused by the firmware-bnx2 module being loaded.. some how, that module is causing -17 oom_adj to be set for everything..

WTF?!?! Surely a bug?? Could someone please forward this to the appropriate person for the bnx2 kernel module, as I wouldn't even know where to begin :S

Cal

On 30/05/2011 11:52, Cal Leeming [Simplicity Media Ltd] wrote:
-resent due to incorrect formatting, sorry if this dups!

@Kame
Thanks for the reply!
Both kernels used the same env/dist, but which slightly different packages.
After many frustrating hours, I have pin pointed this down to a dodgy
Debian package which appears to continue affecting the system, even
after purging. I'm still yet to pin point the package down (I'm doing
several reinstall tests, along with tripwire analysis after each
reboot).

@Hiroyuki
Thank you for sending this to the right people!

@linux-mm
On a side note, would someone mind taking a few minutes to give a
brief explanation as to how the default oom_adj is set, and under what
conditions it is given -17 by default? Is this defined by the
application? I looked through the kernel source,
and noticed some of the code was defaulted to set oom_adj to
OOM_DISABLE (which is defined in the headers as -17).

Assuming the debian problem is resolved, this might be another call
for the oom-killer to be modified so that if it encounters the
unrecoverable loop, it ignores the -17 rule (with some exceptions,
such as kernel processes, and other critical things). If this is going
to be a relatively simple task, I wouldn't mind spending a few hours
patching this?

Cal

On Mon, May 30, 2011 at 3:23 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:
Thank you. memory cgroup and OOM troubles are handled in linux-mm.

On Sun, 29 May 2011 23:24:07 +0100
"Cal Leeming [Simplicity Media Ltd]"<cal.leeming@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:

Some further logs:
./log/syslog:May 30 07:44:38 vicky kernel: [ 2283.369927] redis-server
invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=-17
./log/syslog:May 30 07:44:38 vicky kernel: [ 2283.369939]
[<ffffffff810b12b7>] ? oom_kill_process+0x82/0x283
./log/syslog:May 30 07:44:38 vicky kernel: [ 2283.399285] redis-server
invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=-17
./log/syslog:May 30 07:44:38 vicky kernel: [ 2283.399296]
[<ffffffff810b12b7>] ? oom_kill_process+0x82/0x283
./log/syslog:May 30 07:44:38 vicky kernel: [ 2283.428690] redis-server
invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=-17
./log/syslog:May 30 07:44:38 vicky kernel: [ 2283.428702]
[<ffffffff810b12b7>] ? oom_kill_process+0x82/0x283
./log/syslog:May 30 07:44:38 vicky kernel: [ 2283.487696] redis-server
invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=-17
./log/syslog:May 30 07:44:38 vicky kernel: [ 2283.487708]
[<ffffffff810b12b7>] ? oom_kill_process+0x82/0x283
./log/syslog:May 30 07:44:38 vicky kernel: [ 2283.517023] redis-server
invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=-17
./log/syslog:May 30 07:44:38 vicky kernel: [ 2283.517035]
[<ffffffff810b12b7>] ? oom_kill_process+0x82/0x283
./log/syslog:May 30 07:44:38 vicky kernel: [ 2283.546379] redis-server
invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=-17
./log/syslog:May 30 07:44:38 vicky kernel: [ 2283.546391]
[<ffffffff810b12b7>] ? oom_kill_process+0x82/0x283
./log/syslog:May 30 07:44:43 vicky kernel: [ 2288.310789] redis-server
invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=-17
./log/syslog:May 30 07:44:43 vicky kernel: [ 2288.310804]
[<ffffffff810b12b7>] ? oom_kill_process+0x82/0x283
./log/syslog:May 30 07:44:43 vicky kernel: [ 2288.369918] redis-server
invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=-17
./log/syslog:May 30 07:44:43 vicky kernel: [ 2288.369930]
[<ffffffff810b12b7>] ? oom_kill_process+0x82/0x283
./log/syslog:May 30 07:44:43 vicky kernel: [ 2288.399284] redis-server
invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=-17
./log/syslog:May 30 07:44:43 vicky kernel: [ 2288.399296]
[<ffffffff810b12b7>] ? oom_kill_process+0x82/0x283
./log/syslog:May 30 07:44:43 vicky kernel: [ 2288.433634] redis-server
invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=-17
./log/syslog:May 30 07:44:43 vicky kernel: [ 2288.433648]
[<ffffffff810b12b7>] ? oom_kill_process+0x82/0x283
./log/syslog:May 30 07:44:43 vicky kernel: [ 2288.463947] redis-server
invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=-17
./log/syslog:May 30 07:44:43 vicky kernel: [ 2288.463959]
[<ffffffff810b12b7>] ? oom_kill_process+0x82/0x283
./log/syslog:May 30 07:44:43 vicky kernel: [ 2288.493439] redis-server
invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=-17
./log/syslog:May 30 07:44:43 vicky kernel: [ 2288.493451]
[<ffffffff810b12b7>] ? oom_kill_process+0x82/0x283


hmm, in short, applications has -17 oom_adj in default with 2.6.32.41 ?
AFAIK, no kernel has such crazy settings as default..

Does your 2 kernel uses the same environment/distribution ?

Thanks,
-Kame

On 29/05/2011 22:50, Cal Leeming [Simplicity Media Ltd] wrote:
First of all, my apologies if I have submitted this problem to the
wrong place, spent 20 minutes trying to figure out where it needs to
be sent, and was still none the wiser.

The problem is related to applying memory limitations within a cgroup.
If the OOM killer kicks in, it gets stuck in a loop where it tries to
kill a process which has an oom_adj of -17. This causes an infinite
loop, which in turn locks up the system.

May 30 03:13:08 vicky kernel: [ 1578.117055] Memory cgroup out of
memory: kill process 6016 (java) score 0 or a child
May 30 03:13:08 vicky kernel: [ 1578.117154] Memory cgroup out of
memory: kill process 6016 (java) score 0 or a child
May 30 03:13:08 vicky kernel: [ 1578.117248] Memory cgroup out of
memory: kill process 6016 (java) score 0 or a child
May 30 03:13:08 vicky kernel: [ 1578.117343] Memory cgroup out of
memory: kill process 6016 (java) score 0 or a child
May 30 03:13:08 vicky kernel: [ 1578.117441] Memory cgroup out of
memory: kill process 6016 (java) score 0 or a child


root@vicky [/home/foxx]> uname -a
Linux vicky 2.6.32.41-grsec #3 SMP Mon May 30 02:34:43 BST 2011 x86_64
GNU/Linux
(this happens on both the grsec patched and non patched 2.6.32.41 kernel)

When this is encountered, the memory usage across the whole server is
still within limits (not even hitting swap).

The memory configuration for the cgroup/lxc is:
lxc.cgroup.memory.limit_in_bytes = 3000M
lxc.cgroup.memory.memsw.limit_in_bytes = 3128M

Now, what is even more strange, is that when running under the
2.6.32.28 kernel (both patched and unpatched), this problem doesn't
happen. However, there is a slight difference between the two kernels.
The 2.6.32.28 kernel gives a default of 0 in the /proc/X/oom_adj,
where as the 2.6.32.41 gives a default of -17. I suspect this is the
root cause of why it's showing in the later kernel, but not the earlier.

To test this theory, I started up the lxc on both servers, and then
ran a one liner which showed me all the processes with an oom_adj of -17:

(the below is the older/working kernel)
root@xxxxxxxxxxxxxxxxx [/mnt/encstore/lxc]> uname -a
Linux courtney.internal 2.6.32.28-grsec #3 SMP Fri Feb 18 16:09:07 GMT
2011 x86_64 GNU/Linux
root@xxxxxxxxxxxxxxxxx [/mnt/encstore/lxc]> for x in `find /proc
-iname 'oom_adj' | xargs grep "\-17" | awk -F '/' '{print $3}'` ; do
ps -p $x --no-headers ; done
grep: /proc/1411/task/1411/oom_adj: No such file or directory
grep: /proc/1411/oom_adj: No such file or directory
804 ? 00:00:00 udevd
804 ? 00:00:00 udevd
25536 ? 00:00:00 sshd
25536 ? 00:00:00 sshd
31861 ? 00:00:00 sshd
31861 ? 00:00:00 sshd
32173 ? 00:00:00 udevd
32173 ? 00:00:00 udevd
32174 ? 00:00:00 udevd
32174 ? 00:00:00 udevd

(the below is the newer/broken kernel)
root@vicky [/mnt/encstore/ssd/kernel/linux-2.6.32.41]> uname -a
Linux vicky 2.6.32.41-grsec #3 SMP Mon May 30 02:34:43 BST 2011 x86_64
GNU/Linux
root@vicky [/mnt/encstore/ssd/kernel/linux-2.6.32.41]> for x in
`find /proc -iname 'oom_adj' | xargs grep "\-17" | awk -F '/' '{print
$3}'` ; do ps -p $x --no-headers ; done
grep: /proc/3118/task/3118/oom_adj: No such file or directory
grep: /proc/3118/oom_adj: No such file or directory
895 ? 00:00:00 udevd
895 ? 00:00:00 udevd
1091 ? 00:00:00 udevd
1091 ? 00:00:00 udevd
1092 ? 00:00:00 udevd
1092 ? 00:00:00 udevd
2596 ? 00:00:00 sshd
2596 ? 00:00:00 sshd
2608 ? 00:00:00 sshd
2608 ? 00:00:00 sshd
2613 ? 00:00:00 sshd
2613 ? 00:00:00 sshd
2614 pts/0 00:00:00 bash
2614 pts/0 00:00:00 bash
2620 pts/0 00:00:00 sudo
2620 pts/0 00:00:00 sudo
2621 pts/0 00:00:00 su
2621 pts/0 00:00:00 su
2622 pts/0 00:00:00 bash
2622 pts/0 00:00:00 bash
2685 ? 00:00:00 lxc-start
2685 ? 00:00:00 lxc-start
2699 ? 00:00:00 init
2699 ? 00:00:00 init
2939 ? 00:00:00 rc
2939 ? 00:00:00 rc
2942 ? 00:00:00 startpar
2942 ? 00:00:00 startpar
2964 ? 00:00:00 rsyslogd
2964 ? 00:00:00 rsyslogd
2964 ? 00:00:00 rsyslogd
2964 ? 00:00:00 rsyslogd
2980 ? 00:00:00 startpar
2980 ? 00:00:00 startpar
2981 ? 00:00:00 ctlscript.sh
2981 ? 00:00:00 ctlscript.sh
3016 ? 00:00:00 cron
3016 ? 00:00:00 cron
3025 ? 00:00:00 mysqld_safe
3025 ? 00:00:00 mysqld_safe
3032 ? 00:00:00 sshd
3032 ? 00:00:00 sshd
3097 ? 00:00:00 mysqld.bin
3097 ? 00:00:00 mysqld.bin
3097 ? 00:00:00 mysqld.bin
3097 ? 00:00:00 mysqld.bin
3097 ? 00:00:00 mysqld.bin
3097 ? 00:00:00 mysqld.bin
3097 ? 00:00:00 mysqld.bin
3097 ? 00:00:00 mysqld.bin
3097 ? 00:00:00 mysqld.bin
3097 ? 00:00:00 mysqld.bin
3113 ? 00:00:00 ctl.sh
3113 ? 00:00:00 ctl.sh
3115 ? 00:00:00 sleep
3115 ? 00:00:00 sleep
3116 ? 00:00:00 .memcached.bin
3116 ? 00:00:00 .memcached.bin


As you can see, it is clear that the newer kernel is setting -17 by
default, which in turn is causing the OOM killer loop.

So I began to try and find what may have caused this problem by
comparing the two sources...

I checked the code for all references to 'oom_adj' and 'oom_adjust' in
both code sets, but found no obvious differences:
grep -R -e oom_adjust -e oom_adj . | sort | grep -R -e oom_adjust -e
oom_adj

Then I checked for references to "-17" in all .c and .h files, and
found a couple of matches, but only one obvious one:
grep -R "\-17" . | grep -e ".c:" -e ".h:" -e "\-17" | wc -l
./include/linux/oom.h:#define OOM_DISABLE (-17)

But again, a search for OOM_DISABLE came up with nothing obvious...

In a last ditch attempt, I did a search for all references to 'oom'
(case-insensitive) in both code bases, then compared the two:
root@annabelle [~/lol/linux-2.6.32.28]> grep -i -R "oom" . | sort -n
/tmp/annabelle.oom_adj
root@vicky [/mnt/encstore/ssd/kernel/linux-2.6.32.41]> grep -i -R
"oom" . | sort -n> /tmp/vicky.oom_adj

and this brought back (yet again) nothing obvious..


root@vicky [/mnt/encstore/ssd/kernel/linux-2.6.32.41]> md5sum
./include/linux/oom.h
2a32622f6cd38299fc2801d10a9a3ea8 ./include/linux/oom.h

root@annabelle [~/lol/linux-2.6.32.28]> md5sum ./include/linux/oom.h
2a32622f6cd38299fc2801d10a9a3ea8 ./include/linux/oom.h

root@vicky [/mnt/encstore/ssd/kernel/linux-2.6.32.41]> md5sum
./mm/oom_kill.c
1ef2c2bec19868d13ec66ec22033f10a ./mm/oom_kill.c

root@annabelle [~/lol/linux-2.6.32.28]> md5sum ./mm/oom_kill.c
1ef2c2bec19868d13ec66ec22033f10a ./mm/oom_kill.c



Could anyone please shed some light as to why the default oom_adj is
set to -17 now (and where it is actually set)? From what I can tell,
the fix for this issue will either be:

1. Allow OOM killer to override the decision of ignoring oom_adj ==
-17 if an unrecoverable loop is encountered.
2. Change the default back to 0.

Again, my apologies if this bug report is slightly unorthodox, or
doesn't follow usual procedure etc. I can assure you I have tried my
absolute best to give all the necessary information though.

Cal

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/