Re: ESP corruption bug - what CPUs are affected?

From: Petr Vandrovec
Date: Sat Sep 18 2004 - 15:39:12 EST

Next message: Kenichi Okuyama: "Re: [PATCH] BUG on fsync/fdatasync with Ext3 data=journal"
Previous message: Marc Ballarin: "Re: udev is too slow creating devices"
In reply to: Stas Sergeev: "Re: ESP corruption bug - what CPUs are affected?"
Next in thread: Stas Sergeev: "Re: ESP corruption bug - what CPUs are affected?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sat, Sep 18, 2004 at 11:14:44PM +0400, Stas Sergeev wrote:
> Hi,
>
> Petr Vandrovec wrote:
> >>Does this look reasonable? If it does, I think I
> >>should just start implementing that.
> >Do not forget that you have to implement also return to CPL1, as
> >NMI may arrive while you are running on CPL1. So it may not be
> >as trivial as it seemed.
> I am not sure what special actions have to be
> taken here compared to returning to ring-3 from NMI.
> Is there anywhere in the sources an example to take
> a look at? (sorry for the newbie questions)

It means that you cannot blindly create CPL1 trampoline stack
in some static per-cpu area. But if we can assume that there
is no other CPL1 code in the system, something like code below
could work:

/* + 20 [word 5] SS
+ 16 [word 4] ESP
+ 12 [word 3] EFLAGS
+ 8 [word 2] CS
+ 4 [word 1] EIP
+ 0 [word 0] ESP for popl %esp
*/
u_int32_t cpl1stacks[NUM_CPUS][6];

curCPU = smp_processor_id();
minSP = curCPU * sizeof(cpl1stacks[0]);
maxSP = minSP + sizeof(cpl1stacks[0]);
cpl1stack = cpl1stacks[curCPU];

if (cpl0stack[retCS] & 3 == 1) {
/* Going back to our trampoline */
/* There is no other place in kernel running on CPL1
except our trampoline; so interrupt could occur either
on popl %esp or on iret. If it occured on popl %esp,
just return, code will do proper things. If interrupt
occured on iret, we have to perform popl %esp again,
so that upper bits of %esp are correctly restored
for CPL3 code */
ASSERT(cpl0stack[retCS] == FLAT_4G_CPL1_CS);
ASSERT(cpl0stack[retSS] == SMALL_CPL1_SS);
if (cpl0stack[retEIP] == fixup_proc) {
ASSERT(cpl0stack[retESP] == minSP]);
} else if (cpl0stack[retEIP] == fixup_proc_iret) {
ASSERT(cpl0stack[retESP] & 0xFFFF == minSP + 4);
/* Undo popl %esp - copy value from ESP we were
using on CPL1 back to stack */
cpl1stack[0] = cpl0stack[retESP];
cpl0stack[retEIP] = fixup_proc;
cpl0stack[retESP] = minSP;
} else {
/* unexpected code running on CPL1 */
/* Probably do simple IRET and hope for the best? */
ASSERT(0);
}
iret;
} else {
cpl1Stack[5] = cpl0stack[retSS];
cpl1Stack[4] = cpl0stack[retESP];
cpl1Stack[3] = cpl0stack[retEFLAGS];
cpl1Stack[2] = cpl0stack[retCS];
cpl1Stack[1] = cpl0stack[retEIP];
cpl1Stack[0] = (cpl0stack[retESP] & 0xFFFF0000) | minSP + 4;
cpl0stack[retSS] = SMALL_CPL1_SS;
cpl0stack[retESP] = minSP;
/*
Do NOT clear IF... IF flag is affected only if IOPL >= CPL, so
with IOPL=0 IRET on CPL1 won't reenable interrupts. This is reason
why we cannot use RETF to return from CPL0 to CPL1 (retf
is much faster than iret on P4) (and we cannot use retf for
CPL1->CPL3 due to TF/RF).

Clear TF so we do not start tracing on CPL1 if we trace
userspace, and clear RF, so if somebody intentionaly pointed
hardware breakpoint into CPL1 handler, it will be triggered
(we must use this path for returns from INT1 too, so it
is possible that RF is set in EFLAGS on stack).
*/
cpl0stack[retEFLAGS] &= ~(EFLAGS_TF | EFLAGS_RF);
cpl0stack[retCS] = FLAT_4G_CPL1_CS;
cpl0stack[retEIP] = fixup_proc;
iret;
}

fixup_proc:
popl %esp
fixup_proc_iret;
iret

It assumes that there is one new 32bit CPL1 flat CS descriptor in GDT, and
one 16bit (small) CPL1 SS descriptor (grows up, with limit 24*<num_cpus>
and base of cpl1stacks), plus <num_cpus> 24byte CPL1 stacks (max. 64KB,
so for kernels with more than 2730 CPUs you need more than one stack
descriptor).

> >Maybe all these programs survive that
> >their CPL3 stack changes,
> Most likely they will, I am just not sure. What
> if they disabled interrupts and are switching the
> stack by loading the SS and ESP separately? If we
> interrupt it there, there may be the problems, which
> would be almost impossible to track down later.
> It just looks a bit unsafe to me. Or maybe exploit
> a sigaltstack for that? Hmm, is implementing the
> CPL1 trampoline really that difficult after all?
> I think it is somewhat cleaner and maybe safer.

As in pseudocode above I was able to handle even NMIs with just
24 bytes of stack on CPL1, it is definitely preferred solution.

> >and AFAIK LAR is microcoded on P4.
> Where does this lead us to? Some other problems I
> am not aware about?

It is slow. No other problems, except that doing
(((ss & 4) ? gdt[ss >> 3] : ldt[ss >> 3]) & 0x00????00) == 0x00????00;
may be faster than doing (lar(ss) & 0x00????00) == 0x00????00.
Petr

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Kenichi Okuyama: "Re: [PATCH] BUG on fsync/fdatasync with Ext3 data=journal"
Previous message: Marc Ballarin: "Re: udev is too slow creating devices"
In reply to: Stas Sergeev: "Re: ESP corruption bug - what CPUs are affected?"
Next in thread: Stas Sergeev: "Re: ESP corruption bug - what CPUs are affected?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]