Re: [x86/asm/entry] BUG: unable to handle kernel paging request

From: Brian Gerst
Date: Mon Mar 09 2015 - 08:57:12 EST


On Sun, Mar 8, 2015 at 3:13 PM, Denys Vlasenko <vda.linux@xxxxxxxxxxxxxx> wrote:
> On Sat, Mar 7, 2015 at 1:33 AM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>> On Fri, Mar 6, 2015 at 3:30 PM, Fengguang Wu <fengguang.wu@xxxxxxxxx> wrote:
>>> Greetings,
>>>
>>> 0day kernel testing robot got the below dmesg and the first bad commit is
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/asm
>>>
>>> commit 75182b1632a89f12540baa1806a7c5c180db620c
>>> Author: Andy Lutomirski <luto@xxxxxxxxxxxxxx>
>>> AuthorDate: Thu Mar 5 19:19:03 2015 -0800
>>> Commit: Ingo Molnar <mingo@xxxxxxxxxx>
>>> CommitDate: Fri Mar 6 08:32:57 2015 +0100
>>>
>>> x86/asm/entry: Switch all C consumers of kernel_stack to this_cpu_sp0()
>>
>> This problem only happens on 32-bit kernels, and the culprit is, sort of:
>>
>> /*
>> * The below -8 is to reserve 8 bytes on top of the ring0 stack.
>> * This is necessary to guarantee that the entire "struct pt_regs"
>> * is accessible even if the CPU haven't stored the SS/ESP registers
>> * on the stack (interrupt gate does not save these registers
>> * when switching to the same priv ring).
>> * Therefore beware: accessing the ss/esp fields of the
>> * "struct pt_regs" is possible, but they may contain the
>> * completely wrong values.
>> */
>> #define task_pt_regs(task) \
>> ({ \
>> struct pt_regs *__regs__; \
>> __regs__ = (struct pt_regs *)(KSTK_TOP(task_stack_page(task))-8); \
>> __regs__ - 1; \
>> })
>>
>> I'm confused about multiple things:
>>
>> 1. I don't understand this comment.
>
> Comment says that in 32-bit x86, interrupts and exceptions
> in ring 0 do not push SS,ESP - they only save EFLAGS,CS,EIP
> in iret frame. (This happens because CPL doesn't
> change, not beacuse ot is zero).
>
> IRET insn likewise does not restore SS,ESP if it detects
> that RPL(stack_CS) = RPL(CS).
>
> This was changed for 64-bit more for two reasons:
> (1) it's simpler, and
> (2) it makes it possible for CPU to automatically align stack
> to 16-byte boundary (you need to undo that on iretq,
> therefore you need to always save old RSP).
>
> From AMD docs:
>
> "8.9.3 Interrupt Stack Frame
> In long mode, the return-program stack pointer (SS:RSP) is always
> pushed onto the interrupt-handler
> stack, regardless of whether or not a privilege change occurs.
> Although the SS register is not used in
> 64-bit mode, SS is pushed to allow returns into compatibility mode.
> Pushing SS:RSP unconditionally
> presents operating systems with a consistent interrupt-stack-frame
> size for all interrupts, except for
> error codes. Interrupt service-routine entry points that handle
> interrupts generated by non-error-code
> interrupts can push an error code on the stack for consistency.
> In long mode, when a control transfer to an interrupt handler occurs,
> the processor performs the
> following:
> 1. Aligns the new interrupt-stack frame by masking RSP with
> FFFF_FFFF_FFFF_FFF0h.
> ..."

I believe kernel threads used to run with the stack pointer starting
right at the top of the stack. So if a kernel thread was interrupted
then regs->ss and regs->esp might be above the stack in the next page
(which could possibly fault). However, any code that looks at those
two values on 32-bit without checking regs->cs first is buggy. Even
if they can be read they are not valid if there was no CPL change.

It appears that kernel threads now have an empty pt_regs struct on the
top of the stack, which is used for execve. That would mean there is
sufficient buffer for accessing regs->esp and regs->ss without
crossing the top of the stack. The 8 byte offset is not needed
anymore.

>> 2. copy_thread thinks that sp0 should be task_pt_regs + 1, which is 8
>> bytes below the top of the stack. cpu_tss, formerly INIT_TSS, thinks
>> that sp0 should be the top of the stack. This is inconsistent. What
>> gives?
>
> I ran a small test: made "umask" syscall print where pt_regs is,
> and where it's end is:
>
> SYSCALL_DEFINE1(umask, int, mask)
> {
> struct pt_regs *regs = task_pt_regs(current);
> unsigned long ptregs = (unsigned long)regs;
>
> pr_err("[%d] current:%lx pt_regs:%lx pt_regs_end:%lx\n",
> current->pid, (long)current, ptregs, ptregs + sizeof(*regs));
> ....
>
> Result:
>
> / # umask
> [ 8.393450] [910] current:c777ce00 pt_regs:c7567fb4 pt_regs_end:c7567ff8
> [ 8.396601] [910] current:c777ce00 pt_regs:c7567fb4 pt_regs_end:c7567ff8
>
> It seems to be true that 32-bit x86 always has two unused words
> beyond "struct pt_regs" on stack during syscalls.
>
> IOW: tss.sp0 is *not* set at the very end of the stack on 32-bit.
>
>> 3. vm86 does something terrible to sp0, which might mean that we can't
>> use sp0 to find the top of the stack in general on 32-bit kernels,
>> which is sad. We might need a partial revert or to introduce a
>> per-cpu variable "real_sp0" on 32-bit. Ugh. Could someone who
>> understands vm86 mode better than me give me a hint?

vm86 mode alters sp0 in order to save the 32-bit pt_regs on the top of
the stack. The iret frame for vm86 mode includes ds/es/fs/gs on the
stack before the normal iret frame. This is so that the segments can
be set for real mode semantics. So the normal pt_regs would not be
large enough to include those extra segments in the ire frame.

I had started work on a patch to clean this up, by saving the 32-bit
regs and other data off the stack, but never finished it. I may look
at it again soon.

>> 4. Not directly related, but the 32-bit tss_struct contains this gem:
>>
>> /*
>> * .. and then another 0x100 bytes for the emergency kernel stack:
>> */
>> unsigned long stack[64];
>>
>> Last I checked, 0x100 != 64. Also, wow, this is kind of disgusting. :)
>
>
> Seems to be unused: I commented it out on "defconfig" build
> and got no build errors.

This is only for the sysenter code on a 32-bit kernel. It should be
commented out for a 64-bit kernel.

--
Brian Gerst
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/