Need help debugging crazy kernel memory issue

From: Orion Poplawski
Date: Wed Aug 01 2012 - 18:35:39 EST


I recently started experiencing crashes (every 1-2 days) on one of my ScientificLinux 6.2 boxes. It appears that the machine runs out of memory, but the memory report makes no sense.

I'm seeing it with each of the following kernels:

kernel-2.6.32-220.17.1.el6.x86_64
kernel-2.6.32-220.23.1.el6.x86_64
kernel-2.6.32-279.1.1.el6.x86_64

kernel-2.6.32-220.17.1.el6.x86_64 had run fine for a long time, but reverting to it has not solved the problem. I have tried to revert all other updates from around the time the problems started to no avail.

It seems like everything about the memory config goes screwy. Here's an info dump from normal operation (48GB RAM, 8GB swap):

SysRq : Show Memory
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
CPU 2: hi: 0, btch: 1 usd: 0
CPU 3: hi: 0, btch: 1 usd: 0
CPU 4: hi: 0, btch: 1 usd: 0
CPU 5: hi: 0, btch: 1 usd: 0
CPU 6: hi: 0, btch: 1 usd: 0
CPU 7: hi: 0, btch: 1 usd: 0
CPU 8: hi: 0, btch: 1 usd: 0
CPU 9: hi: 0, btch: 1 usd: 0
CPU 10: hi: 0, btch: 1 usd: 0
CPU 11: hi: 0, btch: 1 usd: 0
CPU 12: hi: 0, btch: 1 usd: 0
CPU 13: hi: 0, btch: 1 usd: 0
CPU 14: hi: 0, btch: 1 usd: 0
CPU 15: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 180
CPU 1: hi: 186, btch: 31 usd: 158
CPU 2: hi: 186, btch: 31 usd: 0
CPU 3: hi: 186, btch: 31 usd: 0
CPU 4: hi: 186, btch: 31 usd: 0
CPU 5: hi: 186, btch: 31 usd: 0
CPU 6: hi: 186, btch: 31 usd: 0
CPU 7: hi: 186, btch: 31 usd: 0
CPU 8: hi: 186, btch: 31 usd: 0
CPU 9: hi: 186, btch: 31 usd: 19
CPU 10: hi: 186, btch: 31 usd: 30
CPU 11: hi: 186, btch: 31 usd: 0
CPU 12: hi: 186, btch: 31 usd: 0
CPU 13: hi: 186, btch: 31 usd: 0
CPU 14: hi: 186, btch: 31 usd: 0
CPU 15: hi: 186, btch: 31 usd: 0
Node 0 Normal per-cpu:
CPU 0: hi: 186, btch: 31 usd: 144
CPU 1: hi: 186, btch: 31 usd: 124
CPU 2: hi: 186, btch: 31 usd: 182
CPU 3: hi: 186, btch: 31 usd: 162
CPU 4: hi: 186, btch: 31 usd: 182
CPU 5: hi: 186, btch: 31 usd: 162
CPU 6: hi: 186, btch: 31 usd: 0
CPU 7: hi: 186, btch: 31 usd: 0
CPU 8: hi: 186, btch: 31 usd: 138
CPU 9: hi: 186, btch: 31 usd: 119
CPU 10: hi: 186, btch: 31 usd: 88
CPU 11: hi: 186, btch: 31 usd: 75
CPU 12: hi: 186, btch: 31 usd: 0
CPU 13: hi: 186, btch: 31 usd: 183
CPU 14: hi: 186, btch: 31 usd: 179
CPU 15: hi: 186, btch: 31 usd: 159
Node 1 Normal per-cpu:
CPU 0: hi: 186, btch: 31 usd: 184
CPU 1: hi: 186, btch: 31 usd: 168
CPU 2: hi: 186, btch: 31 usd: 164
CPU 3: hi: 186, btch: 31 usd: 0
CPU 4: hi: 186, btch: 31 usd: 41
CPU 5: hi: 186, btch: 31 usd: 172
CPU 6: hi: 186, btch: 31 usd: 126
CPU 7: hi: 186, btch: 31 usd: 145
CPU 8: hi: 186, btch: 31 usd: 63
CPU 9: hi: 186, btch: 31 usd: 158
CPU 10: hi: 186, btch: 31 usd: 162
CPU 11: hi: 186, btch: 31 usd: 165
CPU 12: hi: 186, btch: 31 usd: 48
CPU 13: hi: 186, btch: 31 usd: 64
CPU 14: hi: 186, btch: 31 usd: 66
CPU 15: hi: 186, btch: 31 usd: 165
active_anon:1356095 inactive_anon:839 isolated_anon:0
active_file:93487 inactive_file:384620 isolated_file:0
unevictable:0 dirty:65 writeback:0 unstable:0
free:10288450 slab_reclaimable:40068 slab_unreclaimable:34613
mapped:9236 shmem:474 pagetables:6530 bounce:0
Node 0 DMA free:15396kB min:36kB low:44kB high:52kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:14984kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 2991 24201 24201
Node 0 DMA32 free:2563944kB min:8092kB low:10112kB high:12136kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3063392kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 21210 21210
Node 0 Normal free:16895636kB min:57372kB low:71712kB high:86056kB active_anon:2986668kB inactive_anon:92kB active_file:278460kB inactive_file:1316128kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:21719040kB mlocked:0kB dirty:120kB writeback:0kB mapped:18948kB shmem:360kB slab_reclaimable:123620kB slab_unreclaimable:95556kB kernel_stack:7024kB pagetables:12872kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 1 Normal free:21678824kB min:65568kB low:81960kB high:98352kB active_anon:2437712kB inactive_anon:3264kB active_file:95488kB inactive_file:222352kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:24821760kB mlocked:0kB dirty:140kB writeback:0kB mapped:17996kB shmem:1536kB slab_reclaimable:36652kB slab_unreclaimable:42896kB kernel_stack:696kB pagetables:13248kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 1*4kB 2*8kB 1*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15396kB
Node 0 DMA32: 6*4kB 12*8kB 3*16kB 6*32kB 8*64kB 16*128kB 6*256kB 7*512kB 8*1024kB 6*2048kB 619*4096kB = 2563944kB
Node 0 Normal: 1832*4kB 1354*8kB 846*16kB 583*32kB 272*64kB 65*128kB 70*256kB 53*512kB 37*1024kB 2*2048kB 4085*4096kB = 16895280kB
Node 1 Normal: 509*4kB 2146*8kB 1166*16kB 831*32kB 346*64kB 197*128kB 200*256kB 202*512kB 192*1024kB 131*2048kB 5114*4096kB = 21678276kB
478555 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 8388600kB
Total swap = 8388600kB
12582896 pages RAM
227482 pages reserved
450152 pages shared
1712399 pages non-shared

and here's the initial oom and Mem-Info message:

lvm invoked oom-killer: gfp_mask=0x201d0, order=0, oom_adj=0, oom_score_adj=0
lvm cpuset=/ mems_allowed=0
Pid: 3405, comm: lvm Not tainted 2.6.32-279.1.1.el6.x86_64 #1
Call Trace:
[<ffffffff810c4981>] ? cpuset_print_task_mems_allowed+0x91/0xb0
[<ffffffff811170f0>] ? dump_header+0x90/0x1b0
[<ffffffff8121470c>] ? security_real_capable_noaudit+0x3c/0x70
[<ffffffff81117572>] ? oom_kill_process+0x82/0x2a0
[<ffffffff811174b1>] ? select_bad_process+0xe1/0x120
[<ffffffff811179b0>] ? out_of_memory+0x220/0x3c0
[<ffffffff811b3380>] ? blkdev_get_block+0x0/0x70
[<ffffffff811276ce>] ? __alloc_pages_nodemask+0x89e/0x940
[<ffffffff8115c1ea>] ? alloc_pages_current+0xaa/0x110
[<ffffffff811144f7>] ? __page_cache_alloc+0x87/0x90
[<ffffffff81113ede>] ? find_get_page+0x1e/0xa0
[<ffffffff8111606b>] ? do_read_cache_page+0x4b/0x180
[<ffffffff811b4330>] ? blkdev_readpage+0x0/0x20
[<ffffffff811161e9>] ? read_cache_page_async+0x19/0x20
[<ffffffff811161fe>] ? read_cache_page+0xe/0x20
[<ffffffff811ecaa0>] ? read_dev_sector+0x30/0xa0
[<ffffffff811edc5d>] ? amiga_partition+0x6d/0x460
[<ffffffff811161e9>] ? read_cache_page_async+0x19/0x20
[<ffffffff811ecaa0>] ? read_dev_sector+0x30/0xa0
[<ffffffff811ef1ac>] ? osf_partition+0x6c/0x120
[<ffffffff811ed7d7>] ? rescan_partitions+0x1a7/0x470
[<ffffffff811b4ab6>] ? __blkdev_get+0x1b6/0x3c0
[<ffffffff811b4ce0>] ? blkdev_open+0x0/0xc0
[<ffffffff811b4cd0>] ? blkdev_get+0x10/0x20
[<ffffffff811b4d51>] ? blkdev_open+0x71/0xc0
[<ffffffff8117889a>] ? __dentry_open+0x10a/0x360
[<ffffffff8121c272>] ? selinux_inode_permission+0x72/0xb0
[<ffffffff812142af>] ? security_inode_permission+0x1f/0x30
[<ffffffff81178c04>] ? nameidata_to_filp+0x54/0x70
[<ffffffff8118c110>] ? do_filp_open+0x6c0/0xd60
[<ffffffff81198192>] ? alloc_fd+0x92/0x160
[<ffffffff81178649>] ? do_sys_open+0x69/0x140
[<ffffffff81178760>] ? sys_open+0x20/0x30
[<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 42, btch: 7 usd: 23
active_anon:49 inactive_anon:97 isolated_anon:0
active_file:0 inactive_file:0 isolated_file:0
unevictable:3846 dirty:0 writeback:0 unstable:0
free:412 slab_reclaimable:1194 slab_unreclaimable:5681
mapped:356 shmem:0 pagetables:31 bounce:0
Node 0 DMA free:224kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:328kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 125 125 125
Node 0 DMA32 free:1424kB min:1428kB low:1784kB high:2140kB active_anon:196kB inactive_anon:388kB active_file:0kB inactive_file:0kB unevictable:15384kB isolated(anon):0kB isolated(file):0kB present:128256kB mlocked:0kB dirty:0kB writeback:0kB mapped:1424kB shmem:0kB slab_reclaimable:4776kB slab_unreclaimable:22724kB kernel_stack:600kB pagetables:124kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 0*4kB 2*8kB 1*16kB 2*32kB 2*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 224kB
Node 0 DMA32: 0*4kB 2*8kB 2*16kB 1*32kB 1*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1424kB45035
3846 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 0kB
Total swap = 0kB
45035 pages RAM
16585 pages reserved
359 pages shared
23771 pages non-shared


Only 1 cpu listed, memory numbers are incredibly small (180MB of RAM?! - no wonder it is out of memory), no swap, no Normal nodes listed, etc.

It's fairly consistent, I see the same # of pages RAM each time it crashes. I ran memcheck through test #4 with no errors.

cmdline:

ro root=/dev/mapper/vg_root-root rd_MD_UUID=486b6486:65829f41:e3ccc1e2:ace1579a rd_LVM_LV=vg_root/root rd_LVM_LV=vg_root/swap rd_NO_LUKS rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us crashkernel=512M-2G:64M,2G-:128M console=tty0 console=ttyS0,115200


Any ideas? I'm at a loss.

--
Orion Poplawski
Technical Manager 303-415-9701 x222
NWRA, Boulder Office FAX: 303-415-9702
3380 Mitchell Lane orion@xxxxxxxx
Boulder, CO 80301 http://www.nwra.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/