Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

From: Andy Lutomirski
Date: Sun May 03 2015 - 14:41:58 EST


On Fri, May 1, 2015 at 10:27 PM, Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>
> * Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>
>> On Fri, May 1, 2015 at 12:11 PM, Rik van Riel <riel@xxxxxxxxxx> wrote:
>> > On 05/01/2015 02:40 PM, Ingo Molnar wrote:
>> >
>> >> Or we could do that in the syscall path with a single store of a
>> >> constant flag to a location in the task struct. We have a number of
>> >> natural flags that get written on syscall entry, such as:
>> >>
>> >> pushq_cfi $__USER_DS /* pt_regs->ss */
>> >>
>> >> That goes to a constant location on the kernel stack. On return from
>> >> system calls we could write 0 to that location.
>>
>> Huh? IRET with zero there will fault, and we can't guarantee that
>> all syscalls return using sysret. [...]
>
> So IRET is a slowpath - I was thinking about the fastpath mainly.

Slowpath or not, if we're using part of the hardware frame in pt_regs
as an indication of whether we're in user or kernel mode, we don't get
to arbitrarily change it to the "user" state before IRET or the IRET
will fail. We could work around that with some kind of trampoline,
but that's complicated.

>
>> [...] Also, we'd have to audit all the entries, and
>> system_call_after_swapgs currently enables interrupts early enough
>> that an interrupt before all the pushes will do unpredictable things
>> to pt_regs.
>
> An irq hardware frame won't push zero to that selector value, right?
> That's the only bad thing that would confuse the code.
>

I think it's not quite that simple. The syscall entry looks like, roughly:

fix rsp;
sti;
push ss;
push rsp;
push flags;
push cs;
push rip;

We can get an interrupt part-way through those pushes. Maybe there's
no bad place where we could get an IRQ since SS is first, but this is
still nasty.

>
> But yes, safely accessing the remote task is a complication, but with
> such a scheme we cannot avoid it, we'd still have to set TIF_NOHZ. So
> even if we do:
>
>> If we went that route, I'd advocate sticking the flag in tss->sp1.
>> That cacheline is unconditionally read on kernel entry already, and
>> sp1 is available in tip:x86/asm (and maybe even in Linus' tree -- I
>> lost track). [1]
>>
>> Alternatively, I'd suggest that we actually add a whole new word to
>> pt_regs.
>
> ... we'd still have to set TIF_NOHZ and are back to square one in
> terms of race complexity.
>
> But it's not overly complex: by taking the remote CPU's rq-lock from
> synchronize_rcu() we could get a stable pointer to the currently
> executing task.
>
> And if we do that, we might as well use the opportunity and take a
> look at pt_regs as well - this is why sticking it into pt_regs might
> be better.
>
> So I'd:
>
> - enlarge pt_regs by a word and stick the flag there (this
> allocation is essentially free)
>
> - update the word from entry/exit
>
> - synchronize_rcu() avoids having to send an IPI by taking a
> peak at rq->curr's pt_regs::flag, and if:
>
> - the flag is 0 then it has observed a quiescent state.
>
> - the flag is 1, then it would set TIF_NOHZ and wait for a
> completion from a TIF_NOHZ callback.

This seems complicated, especially given IST entries that put their
pt_regs in different places. We get a bit of a free pass on some of
the IST entries due to the IST stack switch thing, but we still need
to worry about NMI and MCE at least.

I think I want to understand the considerations a bit better before I
express a real opinion. IIUC the remote CPU needs to reliably detect
that we've had a grace period. This can potentially happen by
observing us with some quiescent (user-mode, guest-mode or idle) flag
set or by an interrupt incrementing a counter. In CONFIG_PREEMPT_RCU
mode, the remote CPU can presumably observe that we're quiescent
directly.

Does this mean that, in CONFIG_PREEMPT_RCU mode, all of this is
unnecessary? We may still want context tracking for the timing stats
sampling, but I don't see why we need it for RCU.

In non-CONFIG_PREEMPT_RCU mode, I think we need to worry about
possible races that could cause us to fail to ever detect a grace
period. If we have the periodic tick turned off and if the remote CPU
checks our context and sees us non-quiescent, then either we need a
non-racy way to get us to do work when we become quiescent or the
remote CPU needs to eventually send us an IPI. I don't see any way to
do the former without an RMW or barrier on every transition into a
quiescent state. Even if we used ti->flags, I still think that would
require us to upgrade the cli; test on exit to userspace to a
test-and-set or test-and-clear, either of which is much more
expensive.

I think I prefer a per-cpu approach over a per-task approach, since
it's easier to reason about and it should still only require one
instruction on entry and one instruction on exit.

FWIW, if we go the per-cpu route, what if we used a counter instead of
a simple context flag? With a counter, the remote CPU could observe
that we changed state and therefore went through a grace period even
if the remote CPU never actually observes us in the quiescent state.

I'll be travelling and have very high latency for two weeks.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/