Re: [PATCH v3 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not

From: Steven Rostedt
Date: Fri Jan 21 2022 - 12:16:09 EST


On Thu, 20 Jan 2022 16:25:18 +0000
Valentin Schneider <valentin.schneider@xxxxxxx> wrote:

> Hi folks,
>
> Problem
> =======
>
> Abhijeet pointed out that the following sequence of trace events can be
> observed:
>
> cat-1676 [001] d..3 103.010411: sched_waking: comm=grep pid=1677 prio=120 target_cpu=004
> grep-1677 [004] d..2 103.010440: sched_switch: prev_comm=grep prev_pid=1677 prev_prio=120 prev_state=R 0x0 ==> next_comm=swapper/4 next_pid=0 next_prio=120
> <idle>-0 [004] dNh3 103.0100459: sched_wakeup: comm=grep pid=1677 prio=120 target_cpu=004
>
> IOW, not-yet-woken task gets presented as runnable and switched out in
> favor of the idle task... Dietmar and I had a look, turns out this sequence
> can happen:
>
> p->__state = TASK_INTERRUPTIBLE;
> __schedule()
> deactivate_task(p);
> ttwu()
> READ !p->on_rq
> p->__state=TASK_WAKING
> trace_sched_switch()
> __trace_sched_switch_state()
> task_state_index()
> return 0;
>
> TASK_WAKING isn't in TASK_REPORT, hence why task_state_index(p) returns 0.
> This can happen as of commit c6e7bd7afaeb ("sched/core: Optimize ttwu()
> spinning on p->on_cpu") which punted the TASK_WAKING write to the other
> side of
>
> smp_cond_load_acquire(&p->on_cpu, !VAL);
>
> in ttwu().
>
> Uwe reported on #linux-rt what I think is a similar issue with
> TASK_RTLOCK_WAIT on PREEMPT_RT; again that state isn't in TASK_REPORT so
> you get prev_state=0 despite the task blocking on a lock.
>
> Both of those are very confusing for tooling or even human observers.



This all looks fine to me:

Reviewed-by: Steven Rostedt (Google) <rostedt@xxxxxxxxxxx>

Peter, want to take this through your tree?

-- Steve