Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

From: Andy Lutomirski
Date: Fri May 08 2015 - 08:57:19 EST


On Fri, May 8, 2015 at 4:27 AM, Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>
> * Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>
>> On May 8, 2015 12:07 PM, "Ingo Molnar" <mingo@xxxxxxxxxx> wrote:
>> >
>> >
>> > * Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>> >
>> > > > So do you mean:
>> > > >
>> > > > this_cpu_set(rcu_state) = IN_KERNEL;
>> > > > ...
>> > > > this_cpu_inc(rcu_qs_ctr);
>> > > > this_cpu_set(rcu_state) = IN_USER;
>> > > >
>> > > > ?
>> > > >
>> > > > So in your proposal we'd have an INC and two MOVs. I think we can make
>> > > > it just two simple stores into a byte flag, one on entry and one on
>> > > > exit:
>> > > >
>> > > > this_cpu_set(rcu_state) = IN_KERNEL;
>> > > > ...
>> > > > this_cpu_set(rcu_state) = IN_USER;
>> > > >
>> > >
>> > > I was thinking that either a counter or a state flag could make sense.
>> > > Doing both would be pointless. The counter could use the low bits to
>> > > indicate the state. The benefit of the counter would be that the
>> > > RCU-waiting CPU could observe that the counter has incremented and
>> > > that therefore a grace period has elapsed. Getting it right would
>> > > require lots of care.
>> >
>> > So if you mean:
>> >
>> > <syscall entry>
>> > ...
>> > this_cpu_inc(rcu_qs_ctr);
>> > <syscall exit>
>> >
>> > I don't see how this would work reliably: how do you handle the case
>> > of a SCHED_FIFO task never returning from user-space (common technique
>> > in RT apps)? synchronize_rcu() would block indefinitely as it would
>> > never see rcu_qs_ctr increase.
>> >
>> > We have to be able to observe user-mode anyway, for system-time
>> > statistics purposes, and that flag could IMHO also drive the RCU GC
>> > machinery.
>>
>> The counter would have to be fancier than that to work. We could
>> say that an even value means user mode, for example. IOW the high
>> bits of the counter would count transitions to quiescent and the low
>> bits would indicate the state.
>>
>> Actually, I'd count backwards. We could use an andl instruction to
>> go to user mode and a decl to go to kernel mode.
>
> at which point it's not really a monotonic quiescent state counter -
> which your naming suggested before.
>
> Also it would have to be done both at entry and at exit.
>
> But yes, if you replace a state flag with a recursive state flag that
> is INC/DEC maintained that would work as well. That's not a counter
> though.

I won't quibble about the name :)

>
>> > > The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets
>> > > _TIF_RCU_NOTIFY. This could happen arbitrarily late before IRET
>> > > because stores can be delayed. (It could even happen after
>> > > sysret, IIRC, but IRET is serializing.)
>> >
>> > All it has to do in the synchronize_rcu() slowpath is something
>> > like:
>>
>> I don't think this works. Suppose rt_cpu starts in kernel mode.
>> Then it checks ti flags.
>>
>> >
>> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
>> > smp_mb__before_atomic();
>> > set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
>> > smp_rmb();
>> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)
>>
>> Now it sets rcu_state and exits to user mode. It never notices
>> TIF_RCU_NOTIFY.
>
> Indeed, you are right, that's racy.
>
>> [...] We could fix this by sending an IPI to kick it.
>
> With an IPI we won't need any flags or counters on the RT CPU - it's
> the IPI we are trying to avoid.
>
> So how about the following way to fix the race: simply do a poll loop
> with a fixed timeout sleep: like it would do anyway if
> synchronize_rcu() was waiting for the timer irq to end the grace
> period on the RT-CPU.

Seems reasonable to me.

>
> The TIF flag would cause an RCU grace period to lapse, no need to wake
> up the synchronize_rcu() side: it will notice the (regular) increased
> RCU counter.
>
>> > We are talking about a dozen cycles, while a typical
>> > synchronize_rcu() will wait millions (sometimes billions) of
>> > cycles. There's absolutely zero performance concern here and it's
>> > all CPU local in any case.
>>
>> The barrier would affect all syscalls, though. [...]
>
> What barrier? I never suggested any barrier in the syscall fast path,
> this bit:

Oh, I misunderstood.

>
>> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
>> > smp_mb__before_atomic();
>> > set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
>> > smp_rmb();
>> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)
>
> runs inside synchronize_rcu() or so.
>
> But none of that is needed if we do:
>
> - simple IN_KERNEL/IN_USER/[/IN_IDLE] flag updated at context
> entry/exit time, straight in the assembly
>
> - TIF_RCU_QS to trigger a TIF slowpath on the RT-CPU that does
> nothing but: this_cpu_inc(rcu_qs_ctr).
>
> - synchronize_rcu() injects TIF_RCU_QS into the RT-CPU and
> then does a timeout-poll-loop with jiffies granular
> timeouts, simulating a timer IRQ in essence. It will either
> observe IN_USER or will see the regular RCU qs counter
> increase.
>
> this should be race free.
>
> Alternatively we can even get rid of the TIF flag by merging the
> percpu counter with the percpu state. Quiescent states are recorded
> via a counter shifted by two bits:
>
> rcu_qs_ctr += 4;
>
> while user/kernel/idle mode is recorded in the lower 2 bits.
>
> So on entering the kernel we'd do:
>
> rcu_qs_ctr += 4+1; /* Register QS and set bit 0 to IN_KERNEL */
>
> on returning to user-space we'd do:
>
> rcu_qs_ctr += 4-1; /* We have bit 0 set already, clear it and register a QS */
>
> on going idle we'd do:
>
> rcu_qs_ctr += 4+2; /* Register QS, set bit 1 */
>
> on return from idle we'd do:
>
> rcu_qs_ctr += 4-2+1; /* Register QS, clear bit 1, set bit 0 */
>
> etc. On all boundary transitions we can use a constant ADD on a
> suitable percpu variable.

Sounds good to me, except that we need to be careful to distinguish
between non-syscall entries from quiescent states and non-syscall
entries from quiescent states. We could save the old state (as the
current exception_enter code does) or we could allocate enough low
bits for the state that the problem goes away.

I don't think the TIF_RCU_QS variant is worthwhile -- merging the
counter and state is probably both easier and faster.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/