Re: [PATCH] rcu: Is it safe to enter an RCU read-side criticalsection?

From: Steven Rostedt
Date: Mon Sep 09 2013 - 11:22:07 EST

Next message: Alexander Gordeev: "[PATCH 3/9] PCI/MSI/x86: Make return values only 0/-errno when MSIsallocated"
Previous message: Oleg Nesterov: "Re: free_pid() && PIDNS_HASH_ADDING"
In reply to: Frederic Weisbecker: "Re: [PATCH] rcu: Is it safe to enter an RCU read-side criticalsection?"
Next in thread: Steven Rostedt: "Re: [PATCH] rcu: Is it safe to enter an RCU read-side criticalsection?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, 9 Sep 2013 16:40:38 +0200
Frederic Weisbecker <fweisbec@xxxxxxxxx> wrote:

> On Mon, Sep 09, 2013 at 09:29:17AM -0400, Steven Rostedt wrote:
> > On Mon, 9 Sep 2013 09:21:42 -0400
> > Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
> >
> >
> > > Especially since the function name itself is "rcu_is_cpu_idle()" which
> > > tells you it's a cpu state, and not a task state. Why would you care in
> > > RCU if CPU 1 is idle if you are on CPU 2? The name should be changed.
> >
> > > Actually, preempt_count is more a CPU state than a task state. When
> > > preempt_count is set, that CPU can not schedule out the current task.
> > > It really has nothing to do with the task itself. It's the CPU that
> > > can't do something. Preempt count should never traverse with a task
> > > from one CPU to another.
> >
> > I'll take this a step further. Here's a simple rule to determine if
> > something is a task state or a CPU state.
> >
> > If the state migrates with a task from one CPU to another, it's a task
> > state.
>
> This applies to preempt_count as well.

No it does not, because the very definition of preempt_count means the
task can not migrate. And why would a task want preempt count activated?
Because it is dealing with per cpu data! Again, if we look at the
concept, it's not a task state, its a CPU state, because the reason to
disable preemption is that the task needs to look at stuff specific to
the CPU. There's a reason that we don't want the task to migrate, and
that reason has to do with data specific to the CPU not the task.

>
> >
> > If the state never leaves a CPU with a task, then it's a CPU state.
>
> Per CPU and per task property aren't mutually exclusive.

No, but using my simple rules (and that's why I made them simple). They
can be. :-)

> Per task obviously implies per cpu. And per cpu doesn't imply per task.

I disagree. A state of a task can move with the task when it moves to
another CPU. That means the task state does not imply per cpu state.

For example, a task may be sleeping and not on any CPU, but it still
has that "state", where no CPU has it.

And I can say per cpu implies per task, as a state of the CPU implies
that it is also the current state of the task running on that CPU.

A CPU always has one task on it (idle if nothing is running), thus any
CPU state is also some task state. But a task may not be on any CPU,
thus a task state does not imply that any CPU has that state as well.

>
> Taking that into account, when you're dealing with a per task property,
> the rest depends on the POV: you can zoom to the lower level and look at
> things from the CPU POV. Or you can zoom out and look at thing from the
> task POV.

But I just negated "that account", so this no longer applies.

>
> RCU extended quiescent states can be viewed from both POV. RCU just
> only cares about the CPU POV.

No it does not!

What is a grace period? Some task has access to some data that is about
to change, but a synchronize_rcu() has been executed, and needs to wait
for the grace period to end before it can continue. If that task
sleeps, then there's no CPU that RCU cares about. It cares *only* about
the tasks that are preventing the grace period to finish. Why else do
we implement rcu boosting?

Please, lets focus on the concept of RCU not the implementation.

>
> We don't even need a task to set/unset a CPU to/from extended quiescent state.
> It's only a matter of CPU running RCU code. The kernel just happens to use
> tasks to run code.

We are really confusing dynamic ticks and NO_HZ with RCU here. I hate
the term "extended quiescent state", when it really just means the CPU
is not using RCU anymore. Yes, that is a CPU state. But I think we are
getting things backwards. We are saying its a RCU state, when it really
is a CPU state, and this is due to our implementation and not about RCU
itself.

The CPU can not be in this state if the task does something that
requires accessing RCU critical data. Basically, there's a state that
says the task wont be doing any RCU work (going idle, or going into
userland), and thus, we have an implementation to optimize RCU not to
care about this CPU. The confusion here is that this is an optimization
implementation, and not a "concept" of RCU.

It's really a CPU and task state that say "we don't need RCU, so RCU go
do your business someplace else".

But this is implemented in the RCU system, and thus we are saying that
RCU is dictating what is going on here, when it really is the task and
CPU state that are doing the dictating.

This is why I liked "rcu_ignored()" as it really is the real state of
affairs. We are basically saying that the task, or tasks, running on
the CPU is not going to do any RCU work, so the RCU system can safely
ignore this CPU as an optimization implementation.

And this is where all POVs are getting confused :-)

>
> It's a bit the same with spinlocks. spinlocks aren't a task synchronization
> but a CPU synchronization. It's low level. Of course a task can't sleep
> with a spinlock held (not talking about -rt) so it could be defined as a per
> task property. But it's just not relevant.

Again, this is where we get into trouble. No it is not a CPU
synchronization. We only disable preemption because of implementation
details. Not the concept. A spin lock is only used to protect critical
data, and not to disable preemption. Those that use it to disable
preemption has caused us issues in -rt.

This is again the problem with confusing implementation with concepts.
-rt proved that a spin lock has nothing to do with cpu state, nor
preemption.

>
> Mutexes, OTOH, really are task synchronisation. They could be considered from
> the CPU view. That's just not relevant either.

Mutexes are just a different implementation of a lock. And yes, the
sleeping nature of a mutex makes it have a different concept, but
that's because we then can not use it for places that can not sleep.

>
>
> >
> > According to the above rules, rcu_is_cpu_idle() is a task state, and
> > really should be in task_struct, and preempt_count is a CPU state, and
> > should be a per_cpu variable.
>
> No, according to you description, preempt count is a task state.

No, I already stated that it is not a task state.

> Eventually it just happens to be both. But more relevant from the CPU POV.

No it is not.

>
> And really the fact that a state can be viewed per task doesn't mean it's
> always right to store it in the task struct. The semantics also
> play a big role here.

I'll say that implementation can play a big role. If there's a fast way
to implement something, we need to do it that way, even if one looks at
it and says "hmm, that doesn't match the concept". But those need to be
commented.

I'm not saying that we must follow the concept in our implementation,
I'm saying that names of functions and comments need to be focused on
concept, where as the implementation can focus on optimizations, and
comment where they differ.

For example, preempt_count really is a per cpu state. But because of
issues accessing it in assembly, it was added to a task state. And when
things change like this, we have to add hacks when concept and
implementation differ. For example, when interrupts came in and used a
different stack, we now can not use the stack info to access preempt
count. This hack has changed over the years, but just points out how
things go apart when concept and implementation differ.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Alexander Gordeev: "[PATCH 3/9] PCI/MSI/x86: Make return values only 0/-errno when MSIsallocated"
Previous message: Oleg Nesterov: "Re: free_pid() && PIDNS_HASH_ADDING"
In reply to: Frederic Weisbecker: "Re: [PATCH] rcu: Is it safe to enter an RCU read-side criticalsection?"
Next in thread: Steven Rostedt: "Re: [PATCH] rcu: Is it safe to enter an RCU read-side criticalsection?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]