Re: [PATCH 4/5] x86-64: Replace vsyscall gettimeofday fallbackwith int 0xcc

From: Borislav Petkov
Date: Sun May 29 2011 - 16:26:54 EST


I would welcome it very much if this explanation landed somewhere into
<Documentation/x86/x86_64/> for all those of us who find ourselves
staring at entry code now and then :).

Thanks.

On Sun, May 29, 2011 at 09:10:55PM +0200, Ingo Molnar wrote:
>
> * Andy Lutomirski <luto@xxxxxxx> wrote:
>
> > --- a/arch/x86/kernel/entry_64.S
> > +++ b/arch/x86/kernel/entry_64.S
> > @@ -1121,6 +1121,8 @@ zeroentry spurious_interrupt_bug do_spurious_interrupt_bug
> > zeroentry coprocessor_error do_coprocessor_error
> > errorentry alignment_check do_alignment_check
> > zeroentry simd_coprocessor_error do_simd_coprocessor_error
> > +zeroentry intcc do_intcc
> > +
> >
> > /* Reload gs selector with exception handling */
> > /* edi: new selector */
> > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
>
> I forgot to reply to your prior question about zeroentry vs.
> paranoidzeroentry.
>
> That distinction is an undocumented x86-64-ism.
>
> Background:
>
> The SWAPGS instruction is rather fragile: it must nest perfectly and
> only in single depth, it should only be used if entering from user
> mode to kernel mode and then when returning to user-space, and
> precisely so. If we mess that up even slightly, we crash.
>
> So when we have a secondary entry, already in kernel mode, we *must
> not* use SWAPGS blindly - nor must we forget doing a SWAPGS when it's
> not switched/swapped yet.
>
> Now, there's a secondary complication: there's a cheap way to test
> which mode the CPU is in and an expensive way.
>
> The cheap way is to pick this info off the entry frame on the kernel
> stack, from the CS of the ptregs area of the kernel stack:
>
> xorl %ebx,%ebx
> testl $3,CS+8(%rsp)
> je error_kernelspace
> SWAPGS
>
> The expensive (paranoid) way is to read back the MSR_GS_BASE value
> (which is what SWAPGS modifies):
>
> movl $1,%ebx
> movl $MSR_GS_BASE,%ecx
> rdmsr
> testl %edx,%edx
> js 1f /* negative -> in kernel */
> SWAPGS
> xorl %ebx,%ebx
> 1: ret
>
>
> and the whole paranoid non-paranoid macro complexity is about whether
> to suffer that RDMSR cost.
>
> If we are at an interrupt or user-trap/gate-alike boundary then we
> can use the faster check: the stack will be a reliable indicator of
> whether SWAPGS was already done: if we see that we are a secondary
> entry interrupting kernel mode execution, then we know that the GS
> base has already been switched. If it says that we interrupted
> user-space execution then we must do the SWAPGS.
>
> But if we are in an NMI/MCE/DEBUG/whatever super-atomic entry
> context, which might have triggered right after a normal entry wrote
> CS to the stack but before we executed SWAPGS, then the only safe way
> to check for GS is the slower method: the RDMSR.
>
> So we try only to mark those entry methods 'paranoid' that absolutely
> need the more expensive check for the GS base - and we generate all
> 'normal' entry points with the regular (faster) entry macros.
>
> I hope this explains!
>
> All in one, your zeroentry choice should be fine: INT 0xCC will not
> issue in NMI context.
>
> Btw, as a sidenote, and since you are already touching this code,
> would you be interested in putting this explanation into the source
> code? It's certainly not obvious and whoever wrote those macros did
> not think of documenting them for later generations ;-)
>
> > +++ b/arch/x86/kernel/traps.c
> > @@ -872,6 +872,10 @@ void __init trap_init(void)
> > set_bit(SYSCALL_VECTOR, used_vectors);
> > #endif
> >
> > + set_system_intr_gate(0xCC, &intcc);
> > + set_bit(0xCC, used_vectors);
> > + printk(KERN_ERR "intcc gate isntalled\n");
>
> I think you mentioned it but i cannot remember your reasoning why you
> marked it 0xcc (and not closer to the existing syscall vector) -
> please add a comment about it into the source code as well.
>
> Ok, i suspect you marked it 0xCC because that's the INT3 instruction
> - not very useful for exploits?
>
> > +void dotraplinkage do_intcc(struct pt_regs *regs, long error_code)
> > +{
> > + /* Kernel code must never get here. */
> > + if (!user_mode(regs))
> > + BUG();
>
> Nit: you can use BUG_ON() for that.
>
> > + local_irq_enable();
> > +
> > + if (!in_vsyscall_page(regs->ip)) {
> > + struct task_struct *tsk = current;
> > + if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
>
> Nit: please put an empty new line between local variable definitions
> and the first statement that follows - we do this for visual clarity.
>
> A not-so-nit: i'd not limit this message to unhandled signals alone.
> An attacker could install a SIGSEGV handler, send a SIGSEGV and
> attempt the exploit right then - he'll get a free attempt with no
> logging performed, right?.
>
> > + printk_ratelimit()) {
> > + printk(KERN_INFO
> > + "%s[%d] illegal int $0xCC ip:%lx sp:%lx",
> > + tsk->comm, task_pid_nr(tsk),
> > + regs->ip, regs->sp);
>
> I'd suggest putting the text 'exploit attempt?' into the printk
> somewhere - a sysadmin might not necessarily know what an illegal int
> $0xCC is..
>
> > + print_vma_addr(" in ", regs->ip);
> > + printk("\n");
> > + }
> > +
> > + force_sig(SIGSEGV, current);
> > + return;
> > + }
> > +
> > + if (current->seccomp.mode) {
> > + do_exit(SIGKILL);
> > + return;
> > + }
> > +
> > + regs->ax = sys_gettimeofday((struct timeval __user *)regs->di, NULL);
>
> Does the vsyscall gettimeofday ignore the zone parameter too?
>
> > +
> > + local_irq_disable();
> > + return;
> > +}
>
> Nit: no need for a 'return;' at the end of a void function.
>
> Thanks,
>
> Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/