Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

From: Takashi Iwai
Date: Mon Mar 23 2015 - 14:56:52 EST


At Mon, 23 Mar 2015 11:38:30 -0700,
Andy Lutomirski wrote:
>
> On Mon, Mar 23, 2015 at 9:07 AM, Denys Vlasenko <dvlasenk@xxxxxxxxxx> wrote:
> > On 03/23/2015 02:22 PM, Takashi Iwai wrote:
> >> At Mon, 23 Mar 2015 10:35:41 +0100,
> >> Takashi Iwai wrote:
> >>>
> >>> At Mon, 23 Mar 2015 10:02:52 +0100,
> >>> Takashi Iwai wrote:
> >>>>
> >>>> At Fri, 20 Mar 2015 19:16:53 +0100,
> >>>> Denys Vlasenko wrote:
>
> >> I'm really puzzled now. We have a few pieces of information:
> >>
> >> - git bisection pointed the commit 96b6352c1271:
> >> x86_64, entry: Remove the syscall exit audit and schedule optimizations
> >> and reverting this "fixes" the problem indeed. Even just moving two
> >> lines
> >> LOCKDEP_SYS_EXIT
> >> DISABLE_INTERRUPTS(CLBR_NONE)
> >> at the beginning of ret_from_sys_call already fixes. (Of course I
> >> can't prove the fix but it stabilizes for a day without crash while
> >> usually I hit the bug in 10 minutes in full test running.)
> >
> > The commit 96b6352c1271 moved TIF_ALLWORK_MASK check from
> > interrupt-disabled region to interrupt-enabled:
> >
> > cmpq $__NR_syscall_max,%rax
> > ja ret_from_sys_call
> > movq %r10,%rcx
> > call *sys_call_table(,%rax,8) # XXX: rip relative
> > movq %rax,RAX-ARGOFFSET(%rsp)
> > ret_from_sys_call:
> > testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > jnz int_ret_from_sys_call_fixup /* Go the the slow path */
> > LOCKDEP_SYS_EXIT
> > DISABLE_INTERRUPTS(CLBR_NONE)
> > TRACE_IRQS_OFF
> > ...
> > ...
> > int_ret_from_sys_call_fixup:
> > FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
> > jmp int_ret_from_sys_call
> > ...
> > ...
> > GLOBAL(int_ret_from_sys_call)
> > DISABLE_INTERRUPTS(CLBR_NONE)
> > TRACE_IRQS_OFF
> >
> > You reverted that by moving this insn to be after first DISABLE_INTERRUPTS(CLBR_NONE).
> >
> > I also don't see how moving that check (even if it is wrong in a more
> > benign way) can have such a drastic effect.
>
> I bet I see it. I have the advantage of having stared at KVM code and
> cursed at it more recently than you, I suspect. KVM does awful, awful
> things to CPU state, and, as an optimization, it allows kernel code to
> run with CPU state that would be totally invalid in user mode. This
> happens through a bunch of hooks, including this bit in __switch_to:
>
> /*
> * Now maybe reload the debug registers and handle I/O bitmaps
> */
> if (unlikely(task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT ||
> task_thread_info(prev_p)->flags & _TIF_WORK_CTXSW_PREV))
> __switch_to_xtra(prev_p, next_p, tss);
>
> IOW, we *change* tif during context switches.
>
>
> The race looks like this:
>
> testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
> jnz int_ret_from_sys_call_fixup /* Go the the slow path */
>
> --- preempted here, switch to KVM guest ---
>
> KVM guest enters and screws up, say, MSR_SYSCALL_MASK. This wouldn't
> happen to be a *32-bit* KVM guest, perhaps?
>
> Now KVM schedules, calling __switch_to. __switch_to sets
> _TIF_USER_RETURN_NOTIFY. We IRET back to the syscall exit code, turn
> off interrupts, and do sysret. We are now screwed.

Thanks for enlightening! That looks like a feasible scenario.
(I tested only a 64bit KVM guest, BTW.)

> I don't know why this manifests in this particular failure, but any
> number of terrible things could happen now.
>
> FWIW, this will affect things other than KVM. For example, SIGKILL
> sent while a process is sleeping in that two-instruction window won't
> work.
>
> Takashi, can you re-send your patch so we can review it for real in
> light of this race?

The patch below worked. I'll double-check tomorrow whether this
really cures reliably.


thanks,

Takashi

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 1d74d161687c..5340ac7f88a9 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -364,12 +364,12 @@ system_call_fastpath:
* Has incomplete stack frame and undefined top of stack.
*/
ret_from_sys_call:
- testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
- jnz int_ret_from_sys_call_fixup /* Go the the slow path */
-
LOCKDEP_SYS_EXIT
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
+ testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+ jnz int_ret_from_sys_call_fixup /* Go the the slow path */
+
CFI_REMEMBER_STATE
/*
* sysretq will re-enable interrupts:
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/