Slow, persistent memory leak in 2.6.20

From: Fred Tyler
Date: Sun Aug 26 2007 - 10:39:21 EST


I think I've come across a memory leak in 2.6.20. I've upgraded to the
latest 2.6.20.17, but it didn't seem to help.

A little background: I saw something exactly like this many months ago
with a 2.6.12 kernel. However, by 2.6.16.x the leak had apparently
been fixed, so I didn't pursue it. I just assumed it had been fixed.
But either it remains in 2.6.20 or else a new leak has appeared.

FWIW, this is an x86_64 machine, but I also saw nearly the same
behavior on a i386 machine running 2.6.12. (Links to graphs showing
long-term memory usage are at the bottom of this email if you want to
skip all the text stats in the middle.)

Immediately after booting the system, I shut down all services to get
a baseline for comparison. Here is the output of top and vmstat with
virtually nothing running:

=========== top =============

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
754 root 15 0 16948 2368 1732 R 0.0 0.3 0:00.01 sshd
757 root 15 0 5620 1440 1124 S 0.0 0.2 0:00.00 bash
1195 root 15 0 6300 1116 880 R 0.3 0.1 0:00.02 top
1196 root 18 0 3880 628 516 S 0.0 0.1 0:00.00 agetty
1 root 18 0 3888 516 412 S 0.0 0.1 0:00.26 init
741 root 18 0 3880 508 412 S 0.0 0.1 0:00.00 agetty
742 root 18 0 3876 504 412 S 0.0 0.1 0:00.00 agetty
743 root 18 0 3876 504 412 S 0.0 0.1 0:00.00 agetty
2 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
3 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
4 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 events/0
5 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 khelper
6 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 kthread
57 root 10 -5 0 0 0 S 0.0 0.0 0:00.02 kblockd/0
58 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 ata/0
59 root 20 -5 0 0 0 S 0.0 0.0 0:00.00 ata_aux
62 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 khubd
64 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 kseriod
121 root 25 0 0 0 0 S 0.0 0.0 0:00.00 pdflush
122 root 15 0 0 0 0 S 0.0 0.0 0:00.00 pdflush
123 root 20 -5 0 0 0 S 0.0 0.0 0:00.00 kswapd0
124 root 19 -5 0 0 0 S 0.0 0.0 0:00.00 aio/0
220 root 13 -5 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_0
221 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_1
245 root 11 -5 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_2
246 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 usb-storage
256 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 reiserfs/0

=========== free ============

total used free shared buffers cached
Mem: 899408 96824 802584 0 12604 70064
-/+ buffers/cache: 14156 885252
Swap: 65528 0 65528


=========== vmstat ==============

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 0 802152 13008 70096 0 0 402 184 282 87 2 1 88 9


=========== vmstat -s ===============

899408 total memory
97248 used memory
50352 active memory
34368 inactive memory
802160 free memory
13080 buffer memory
70104 swap cache
65528 total swap
0 used swap
65528 free swap
349 non-nice user cpu ticks
0 nice user cpu ticks
172 system cpu ticks
15743 idle cpu ticks
1522 IO-wait cpu ticks
0 IRQ cpu ticks
10 softirq cpu ticks
0 stolen cpu ticks
69682 pages paged in
32228 pages paged out
0 pages swapped in
0 pages swapped out
50228 interrupts
15207 CPU context switches
1188132534 boot time
1213 forks

==================================



Ok, now I start back up all services and let the system run for about
12 hours. At the end of this time, I shut down all services again so
that virtually nothing is running. Here are the stats:



=========== top ================

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
17250 root 15 0 16952 2372 1732 R 0.0 0.3 0:00.09 sshd
17253 root 15 0 5624 1448 1124 S 0.0 0.2 0:00.01 bash
23409 root 15 0 6304 1124 884 R 0.0 0.1 0:00.00 top
23410 root 18 0 3880 628 516 S 0.0 0.1 0:00.00 agetty
1 root 18 0 3884 516 412 S 0.0 0.1 0:00.56 init
750 root 18 0 3880 508 412 S 0.0 0.1 0:00.00 agetty
751 root 18 0 3880 508 412 S 0.0 0.1 0:00.00 agetty
749 root 18 0 3876 504 412 S 0.0 0.1 0:00.00 agetty
2 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
3 root 34 19 0 0 0 S 0.0 0.0 0:00.01 ksoftirqd/0
4 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 events/0
5 root 20 -5 0 0 0 S 0.0 0.0 0:00.00 khelper
6 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 kthread
57 root 10 -5 0 0 0 S 0.0 0.0 0:00.31 kblockd/0
58 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 ata/0
59 root 20 -5 0 0 0 S 0.0 0.0 0:00.00 ata_aux
62 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 khubd
64 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 kseriod
121 root 15 0 0 0 0 S 0.0 0.0 0:00.00 pdflush
123 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 kswapd0
124 root 20 -5 0 0 0 S 0.0 0.0 0:00.00 aio/0
220 root 13 -5 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_0
221 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_1
245 root 11 -5 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_2
246 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 usb-storage
256 root 10 -5 0 0 0 S 0.0 0.0 0:00.02 reiserfs/0
17277 root 15 0 0 0 0 S 0.0 0.0 0:00.16 pdflush

============= free ===========

total used free shared buffers cached
Mem: 899408 747128 152280 0 166228 444540
-/+ buffers/cache: 136360 763048
Swap: 65528 0 65528


============= vmstat ==============

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 0 152288 166228 444564 0 0 10 30 255 29 0 0 99 0

============ vmstat -s ==============

899408 total memory
747248 used memory
338700 active memory
273736 inactive memory
152160 free memory
166228 buffer memory
444660 swap cache
65528 total swap
0 used swap
65528 free swap
7522 non-nice user cpu ticks
300 nice user cpu ticks
4120 system cpu ticks
3699397 idle cpu ticks
13963 IO-wait cpu ticks
49 IRQ cpu ticks
146 softirq cpu ticks
0 stolen cpu ticks
355378 pages paged in
1108508 pages paged out
0 pages swapped in
0 pages swapped out
9505965 interrupts
1095062 CPU context switches
1188095217 boot time
23440 forks

======================================


After 12 hours, you can see that when I shut down all of the services
there is a lot of memory being used. But where is it going?

I have compared this to a machine running i386 2.6.16.2x and when I
stop all services down to nothing but ssh, there is only a tiny amount
of RAM in use, as expected.

I can verify that this memory loss never stops: The lost memory keeps
increasing until eventually the machine goes into swap and will
eventually crash if left to its own devices. However, on machines with
big RAM, this process can take a month or more.

Here are links to three cacti graphs where you can see the effect over
the long term:

This graph is from a machine running 2.6.16.27/i386, which does not
have any memory loss. You can see the long-term memory line is flat:

http://i239.photobucket.com/albums/ff117/fredty8/memory-a4.png

Now here is a graph from a machine running 2.6.12/i386, which clearly
shows a long-term memory loss. The points where the memory shoots back
up to its full level are when the machine had to be rebooted because
it was going into swap:

http://i239.photobucket.com/albums/ff117/fredty8/memory-a2.png

And finally, here is a graph from a machine running 2.6.20.15/x86_64,
which shows a very similar memory loss as the 2.6.12 machine. (This
machine has only been up for a few weeks, which is why the graph is so
short. But it is clear that the graph is doing the same thing as
2.6.12):

http://i239.photobucket.com/albums/ff117/fredty8/memory-b1.png


If you need any more information from me, I'll be happy to provide it.
Please CC me on replies.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/