Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

From: Andy Lutomirski
Date: Wed Mar 18 2015 - 15:27:00 EST


Hi Linus-

You seem to enjoy debugging these things. Want to give this a shot?
My guess is a vmalloc fault accessing either old_rsp or kernel_stack
right after swapgs in syscall entry.

On Wed, Mar 18, 2015 at 12:03 PM, Stefan Seyfried
<stefan.seyfried@xxxxxxxxxxxxxx> wrote:
> Hi all,
>
> first, I'm kind of happy that I'm not the only one seeing this, and
> thus my beloved Thinkpad can stay for a bit longer... :-)
>
> Then, I'm mostly an amateur when it comes to kernel debugging, so bear
> with me when I'm stumbling through the code...
>
> Am 18.03.2015 um 19:03 schrieb Andy Lutomirski:
>> On Wed, Mar 18, 2015 at 10:46 AM, Takashi Iwai <tiwai@xxxxxxx> wrote:
>>> At Wed, 18 Mar 2015 18:43:52 +0100,
>>> Takashi Iwai wrote:
>>>>
>>>> At Wed, 18 Mar 2015 15:16:42 +0100,
>>>> Takashi Iwai wrote:
>>>>>
>>>>> At Sun, 15 Mar 2015 09:17:15 +0100,
>>>>> Stefan Seyfried wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> in 4.0-rc I have recently seen a few crashes, always when running
>>>>>> KVM guests (IIRC). Today I was able to capture a crash dump, this
>>>>>> is the backtrace from dmesg.txt:
>>>>>>
>>>>>> [242060.604870] PANIC: double fault, error_code: 0x0
>>
>> OK, we double faulted. Too bad that x86 CPUs don't tell us why.
>>
>>>>>> [242060.604878] CPU: 1 PID: 2132 Comm: qemu-system-x86 Tainted: G W 4.0.0-rc3-2.gd5c547f-desktop #1
>>>>>> [242060.604880] Hardware name: LENOVO 74665EG/74665EG, BIOS 6DET71WW (3.21 ) 12/13/2011
>>>>>> [242060.604883] task: ffff880103f46150 ti: ffff8801013d4000 task.ti: ffff8801013d4000
>>>>>> [242060.604885] RIP: 0010:[<ffffffff816834ad>] [<ffffffff816834ad>] page_fault+0xd/0x30
>>
>> The double fault happened during page fault processing. Could you
>> disassemble your page_fault function to find the offending
>> instruction?
>
> This one is easy:
>
> crash> disassemble page_fault
> Dump of assembler code for function page_fault:
> 0xffffffff816834a0 <+0>: data32 xchg %ax,%ax
> 0xffffffff816834a3 <+3>: data32 xchg %ax,%ax
> 0xffffffff816834a6 <+6>: data32 xchg %ax,%ax
> 0xffffffff816834a9 <+9>: sub $0x78,%rsp
> 0xffffffff816834ad <+13>: callq 0xffffffff81683620 <error_entry>

The callq was the double-faulting instruction, and it is indeed the
first function in here that would have accessed the stack. (The sub
*changes* rsp but isn't a memory access.) So, since RSP is bogus, we
page fault, and the page fault is promoted to a double fault. The
surprising thing is that the page fault itself seems to have been
delivered okay, and RSP wasn't on a page boundary.

You wouldn't happen to be using a Broadwell machine?

The only way to get here with bogus RSP is if we interrupted something
that was previously running at CPL0 with similarly bogus RSP.

I don't know if I trust CR2. It's 16 bytes lower than I'd expect.

> 0xffffffff816834b2 <+18>: mov %rsp,%rdi
> 0xffffffff816834b5 <+21>: mov 0x78(%rsp),%rsi
> 0xffffffff816834ba <+26>: movq $0xffffffffffffffff,0x78(%rsp)
> 0xffffffff816834c3 <+35>: callq 0xffffffff810504e0 <do_page_fault>
> 0xffffffff816834c8 <+40>: jmpq 0xffffffff816836d0 <error_exit>
> End of assembler dump.
>
>
>>>>>> [242060.604893] RSP: 0018:00007fffa55eafb8 EFLAGS: 00010016
>>
>> Uh, what? That RSP is a user address.
>>
>>>>>> [242060.604895] RAX: 000000000000aa40 RBX: 0000000000000001 RCX: ffffffff81682237
>>>>>> [242060.604896] RDX: 000000000000aa40 RSI: 0000000000000000 RDI: 00007fffa55eb078
>>>>>> [242060.604898] RBP: 00007fffa55f1c1c R08: 0000000000000008 R09: 0000000000000000
>>>>>> [242060.604900] R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000004a
>>>>>> [242060.604902] R13: 00007ffa356b5d60 R14: 000000000000000f R15: 00007ffa3556cf20
>>>>>> [242060.604904] FS: 00007ffa33dbfa80(0000) GS:ffff88023bc80000(0000) knlGS:0000000000000000
>>>>>> [242060.604906] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>> [242060.604908] CR2: 00007fffa55eafa8 CR3: 0000000002d7e000 CR4: 00000000000427e0
>>>>>> [242060.604909] Stack:
>>>>>> [242060.604942] BUG: unable to handle kernel paging request at 00007fffa55eafb8
>>>>>> [242060.604995] IP: [<ffffffff81005b44>] show_stack_log_lvl+0x124/0x190
>>
>> This is suspicious. We need to have died, again, of a fatal page
>> fault while dumping the stack.
>
> I posted the same problem to the opensuse kernel list shortly before turning
> to LKML. There, Michal Kubecek noted:
>
> "I encountered a similar problem recently. The thing is, x86
> specification says that on a double fault, RIP and RSP registers are
> undefined, i.e. you not only can't expect them to contain values
> corresponding to the first or second fault but you can't even expect
> them to have any usable values at all. Unfortunately the kernel double
> fault handler doesn't take this into account and does try to display
> usual crash related information so that it itself does usually crash
> when trying to show stack content (that's the show_stack_log_lvl()
> crash).

I think that's not entirely true. RIP is reliable for many classes of
double faults, and we rely on that for espfix64. The fact that hpa
was willing to write that code strongly suggests that Intel chips at
least really do work that way.

>
> The result is a double fault (which itself would be very hard to debug)
> followed by a crash in its handler so that analysing the outcome is
> extremely difficult."
>
> I cannot judge if this is true, but it sounded related to solving the
> problem to me.


The crash in the handler is a separate bug.



>
>>>>>> [242060.605036] PGD 4779a067 PUD 40e3e067 PMD 4769e067 PTE 0
>>>>>> [242060.605078] Oops: 0000 [#1] PREEMPT SMP
>>>>>> [242060.605106] Modules linked in: vhost_net vhost macvtap macvlan nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache nls_iso8859_1 nls_cp437 vfat fat ppp_deflate bsd_comp ppp_async crc_ccitt ppp_generic slhc ses enclosure uas usb_storage cmac algif_hash ctr ccm rfcomm fuse xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_tcpudp tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables af_packet bnep dm_crypt ecb cbc algif_skcipher af_alg xfs libcrc32c snd_hda_codec_conexant snd_hda_codec_generic iTCO_wdt iTCO_vendor_support snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm_oss snd_pcm
>>>>>> [242060.605396] dm_mod snd_seq snd_seq_device snd_timer coretemp kvm_intel kvm snd_mixer_oss cdc_ether cdc_wdm cdc_acm usbnet mii arc4 uvcvideo videobuf2_vmalloc videobuf2_memops thinkpad_acpi videobuf2_core btusb v4l2_common videodev i2c_i801 iwldvm bluetooth serio_raw mac80211 pcspkr e1000e iwlwifi snd lpc_ich mei_me ptp mfd_core pps_core mei cfg80211 shpchp wmi soundcore rfkill battery ac tpm_tis tpm acpi_cpufreq i915 xhci_pci xhci_hcd i2c_algo_bit drm_kms_helper drm thermal video button processor sg loop
>>>>>> [242060.605396] CPU: 1 PID: 2132 Comm: qemu-system-x86 Tainted: G W 4.0.0-rc3-2.gd5c547f-desktop #1
>>>>>> [242060.605396] Hardware name: LENOVO 74665EG/74665EG, BIOS 6DET71WW (3.21 ) 12/13/2011
>>>>>> [242060.605396] task: ffff880103f46150 ti: ffff8801013d4000 task.ti: ffff8801013d4000
>>>>>> [242060.605396] RIP: 0010:[<ffffffff81005b44>] [<ffffffff81005b44>] show_stack_log_lvl+0x124/0x190
>>>>>> [242060.605396] RSP: 0018:ffff88023bc84e88 EFLAGS: 00010046
>>>>>> [242060.605396] RAX: 00007fffa55eafc0 RBX: 00007fffa55eafb8 RCX: ffff88023bc7ffc0
>>>>>> [242060.605396] RDX: 0000000000000000 RSI: ffff88023bc84f58 RDI: 0000000000000000
>>>>>> [242060.605396] RBP: ffff88023bc83fc0 R08: ffffffff81a2fe15 R09: 0000000000000020
>>>>>> [242060.605396] R10: 0000000000000afb R11: ffff88023bc84bee R12: ffff88023bc84f58
>>>>>> [242060.605396] R13: 0000000000000000 R14: ffffffff81a2fe15 R15: 0000000000000000
>>>>>> [242060.605396] FS: 00007ffa33dbfa80(0000) GS:ffff88023bc80000(0000) knlGS:0000000000000000
>>>>>> [242060.605396] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>> [242060.605396] CR2: 00007fffa55eafb8 CR3: 0000000002d7e000 CR4: 00000000000427e0
>>>>>> [242060.605396] Stack:
>>>>>> [242060.605396] 0000000002d7e000 0000000000000008 ffff88023bc84ee8 00007fffa55eafb8
>>>>>> [242060.605396] 0000000000000000 ffff88023bc84f58 00007fffa55eafb8 0000000000000040
>>>>>> [242060.605396] 00007ffa356b5d60 000000000000000f 00007ffa3556cf20 ffffffff81005c36
>>>>>> [242060.605396] Call Trace:
>>>>>> [242060.605396] [<ffffffff81005c36>] show_regs+0x86/0x210
>>>>>> [242060.605396] [<ffffffff8104636f>] df_debug+0x1f/0x30
>>>>>> [242060.605396] [<ffffffff810041a4>] do_double_fault+0x84/0x100
>>>>>> [242060.605396] [<ffffffff81683088>] double_fault+0x28/0x30
>>>>>> [242060.605396] [<ffffffff816834ad>] page_fault+0xd/0x30
>>>>>> [242060.605396] Code: fe a2 81 31 c0 89 54 24 08 48 89 0c 24 48 8b 5b f8 e8 cc 06 67 00 48 8b 0c 24 8b 54 24 08 85 d2 74 05 f6 c2 03 74 48 48 8d 43 08 <48> 8b 33 48 c7 c7 0d fe a2 81 89 54 24 14 48 89 4c 24 08 48 89
>>>>>> [242060.605396] RIP [<ffffffff81005b44>] show_stack_log_lvl+0x124/0x190
>>>>>> [242060.605396] RSP <ffff88023bc84e88>
>>>>>> [242060.605396] CR2: 00007fffa55eafb8
>>>>>>
>>>>>> I would not totally rule out a hardware problem, since this machine had
>>>>>> another weird crash where it crashed and the bios beeper was constant
>>>>>> on until I hit the power button for 5 seconds.
>>>>>>
>>>>>> Unfortunately, I cannot load the crashdump with the crash version in
>>>>>> openSUSE Tumbleweed, so the backtrace is all I have for now.
>>>>>
>>>>> Just "me too", I'm getting the very same crash out of sudden with the
>>>>> recent 4.0-rc. Judging from the very same pattern (usually crash
>>>>> happened while using KVM (-smp 4) and kernel builds with -j8), I don't
>>>>> think it's a hardware problem.
>>>>
>>>> The git bisection pointed to the commit:
>>>> commit b926e6f61a26036ee9eabe6761483954d481ad25
>>>> x86, traps: Fix ist_enter from userspace
>>>>
>>>> And reverting this on top of the latest Linus tree seems working.
>>>> Seife, could you verify on your machine, too?
>>>
>>> Argh, false positive. Right after I wrote this mail, I got the very
>>> same crash. I seem to need running the test much longer than I
>>> thought.
>>>
>>> But somehow the commits around the above smell suspicious...
>>>
>>
>> Those commits shouldn't really have affected page fault or double
>> fault behavior. They made big changes to MCE, breakpoints, and debug
>> exceptions.
>>
>> Something's very wrong here. I'm guessing that we somehow ended up in
>> page_fault in a completely invalid context.
>>
>> One hairy code path that could plausibly do this is:
>>
>> 1. syscall
>>
>> 2. vmalloc fault accessing old_rsp aka rsp_scratch (or kernel_stack --
>> same issue)
>>
>> 3. page fault. Now we're on the user stack and we do awful things.
>> If we run off the end of the presently writable portion of the stack,
>> we double fault.
>
> Maybe Michal's idea from above points in the right direction?
>
> Now since I have a crash dump and the corresponding debuginfo at hand,
> might this help somehow to find out where the problem originated from?
>
> I mean -- it's only 508 processes to look at :-) but if I knew what to
> look for in their backtraces, I would try to do it.

The relevant thread's stack is here (see ti in the trace):

ffff8801013d4000

It could be interesting to see what's there.

I don't suppose you want to try to walk the paging structures to see
if ffff88023bc80000 (i.e. gsbase) and, more specifically,
ffff88023bc80000 + old_rsp and ffff88023bc80000 + kernel_stack are
present? You'd only have to walk one level -- presumably, if the PGD
entry is there, the rest of the entries are okay, too.


--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/