Re: 3.0.60: general protection fault: 0000, Fixing recursive faultbut reboot is needed

From: Mike Galbraith
Date: Mon Apr 15 2013 - 04:30:15 EST


On Mon, 2013-04-15 at 07:33 +0200, Nikola Ciprich wrote:
> Hi,
>
> one of our servers keeps spitting GPF messages:
> (sorry for long message)
>
> [34110.179005] general protection fault: 0000 [#1] PREEMPT SMP
> [34110.185000] CPU 0
> [34110.186872] Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler ip6table_filter ip6_tables ipt_MASQUERADE ipt_REJECT xt_CHECKSUM vhost_net macvtap macvlan tun virtio_net virtio virtio_ring kvm_intel kvm sch_htb xt_IMQ imq xt_physdev xt_comment ipt_REDIRECT xt_tcpudp xt_mark xt_multiport xt_conntrack nf_nat_ftp nf_conntrack_ftp iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables capi ipt_ULOG x_tables nfs lockd auth_rpcgss nfs_acl autofs4 sunrpc bridge stp llc ipv6 ext3 jbd kernelcapi avmfritz mISDNipac mISDN_core joydev processor thermal_sys pcspkr ghes hed i7core_edac edac_core i2c_i801 i2c_core iTCO_wdt e1000e sg usbhid ext4 jbd2 crc16 sd_mod crc_t10dif ehci_hcd arcmsr scsi_mod button dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_msghandler]
> [34110.265159]
> [34110.266744] Pid: 5628, comm: kavupdater Not tainted 3.0.60lb6.01 #1 Supermicro X8SIA/X8SIA
> [34110.276854] RIP: 0010:[<ffffffff8115c730>] [<ffffffff8115c730>] dup_fd+0x170/0x320
> [34110.284698] RSP: 0018:ffff880230e2bd90 EFLAGS: 00010206
> [34110.290251] RAX: 00000000000007f8 RBX: ffff880040fd9600 RCX: bfffffffffffffff
> [34110.297470] RDX: 0000880233743f00 RSI: 00000000000000ff RDI: 0000000000000800
> [34110.304687] RBP: ffff880230e2bde0 R08: ffff88003c25fe40 R09: 0000000000000003
> [34110.311990] R10: 0000000000000001 R11: 4000000000000000 R12: ffff88003c0f2000
> [34110.319286] R13: ffff88022e92b800 R14: ffff88003c25fa40 R15: 0000000000000100
> [34110.326521] FS: 00007f2badf40700(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
> [34110.334819] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [34110.340651] CR2: 0000000001c5f710 CR3: 00000002300ef000 CR4: 00000000000026e0
> [34110.348015] DR0: 00000000000000a0 DR1: 0000000000000000 DR2: 0000000000000003
> [34110.355300] DR3: 00000000000000b0 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [34110.362560] Process kavupdater (pid: 5628, threadinfo ffff880230e2a000, task ffff880231c2c5f0)
> [34110.371412] Stack:
> [34110.373507] 0000000000000020 ffff880233753940 ffff880040fd9610 ffff88022eb6a180
> [34110.381260] 00007f2badf409d0 0000000001200011 ffff8800487245f0 0000000000000000
> [34110.389065] 00007f2badf409d0 0000000000000000 ffff880230e2be80 ffffffff8104f77b
> [34110.396941] Call Trace:
> [34110.399478] [<ffffffff8104f77b>] copy_process+0xd1b/0x13b0
> [34110.405234] [<ffffffff8102f410>] ? do_page_fault+0x1d0/0x480
> [34110.411062] [<ffffffff8104fe65>] do_fork+0x55/0x380
> [34110.416126] [<ffffffff813c014e>] ? _raw_spin_unlock_irq+0xe/0x40
> [34110.422304] [<ffffffff813c014e>] ? _raw_spin_unlock_irq+0xe/0x40
> [34110.428621] [<ffffffff81064f83>] ? set_current_blocked+0x53/0x60
> [34110.434801] [<ffffffff8100b358>] sys_clone+0x28/0x30
> [34110.440000] [<ffffffff813c10a3>] stub_clone+0x13/0x20
> [34110.445253] [<ffffffff813c0d82>] ? system_call_fastpath+0x16/0x1b
> [34110.451584] Code: 7e 10 48 8b 71 10 4c 89 c2 e8 ed ba 0a 00 45 85 ff 74 71 41 8d 47 ff 31 f6 41 ba 01 00 00 00 48 8d 3c c5 08 00 00 00 31 c0 eb 15 <f0> 48 ff 42 48 49 89 14 04 48 83 c0 08 83 c6 01 48 39 f8 74 3b
> [34110.475183] RIP [<ffffffff8115c730>] dup_fd+0x170/0x320
> [34110.480626] RSP <ffff880230e2bd90>
> [34110.484409] ---[ end trace 771117da60ee2556 ]---

Feeding that to scripts/decodecode
Code: 7e 10 48 8b 71 10 4c 89 c2 e8 ed ba 0a 00 45 85 ff 74 71 41 8d 47 ff 31 f6 41 ba 01 00 00 00 48 8d 3c c5 08 00 00 00 31 c0 eb 15 <f0> 48 ff 42 48 49 89 14 04 48 83 c0 08 83 c6 01 48 39 f8 74 3b
All code
========
0: 7e 10 jle 0x12
2: 48 8b 71 10 mov 0x10(%rcx),%rsi
6: 4c 89 c2 mov %r8,%rdx
9: e8 ed ba 0a 00 callq 0xabafb
e: 45 85 ff test %r15d,%r15d
11: 74 71 je 0x84
13: 41 8d 47 ff lea -0x1(%r15),%eax
17: 31 f6 xor %esi,%esi
19: 41 ba 01 00 00 00 mov $0x1,%r10d
1f: 48 8d 3c c5 08 00 00 lea 0x8(,%rax,8),%rdi
26: 00
27: 31 c0 xor %eax,%eax
29: eb 15 jmp 0x40
2b:* f0 48 ff 42 48 lock incq 0x48(%rdx) <-- trapping instruction
30: 49 89 14 04 mov %rdx,(%r12,%rax,1)
34: 48 83 c0 08 add $0x8,%rax
38: 83 c6 01 add $0x1,%esi
3b: 48 39 f8 cmp %rdi,%rax
3e: 74 3b je 0x7b

RDX: 0000880233743f00.. that certainly will go boom.

That's here in dup_fd():
for (i = open_files; i != 0; i--) {
struct file *f = *old_fds++;
if (f) {
get_file(f);

It's doing that get_file(), grabbing a reference to all open files in a
loop, but old_fds points off into lala land, so I'd say you must have
memory corruption, and open_files is garbage. Seeing "One of our
servers..", operative word being "one", I'd tend to suspect heat or such
given the box exploded in this extremely heavily exercised spot.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/