Re: [PATCH] Fix tasks being forgotten for a long time on SMP

From: Yuriy Romanenko
Date: Thu Sep 22 2016 - 15:06:13 EST


Hello,

I just spent three hours studying TOT scheduler code, and I am leaning
towards that this patch being no longer relevant (or being never
relevant, really).
If it ever fixed a real issue, I think it fixed it incorrectly. However, I would
love to understand what exactly we were seeing and why this helped at all.

So a little background on this. I found this change I made that I failed to
upstream from a few years ago, so my memory is a little rusty. This
was definitely an issue on the setup we had, which was an arch/arm
mach-msm (now known as mach-qcom, I believe) fork of 3.4.0

> WTH his SWFI and which arch has that?

SWFI is "Stop and Wait For Interrupt". This appears to be a qcom specific term
that they stopped using, so I apologize for that. It specifically
refers to a CPU that
is sitting in some sort of a "wait" or "wfi" instruction and not
executing anything
until an interrupt occurs.

> Yeah, no, this is completely insane.

Totally possible. In fact, I sort of hope that it is.

> If the remote cpu is running the idle task, check_preempt_curr() should
> very much wake it up, if its not the idle class, it should never get
> there because there is now an actually runnable task on.

> Please explain in detail what happens and/or provide traces of this
> happening.

We observed two relevant scenarios that are similar but distinct.
Looking through
top-of-tree code more carefully I don't see either as possible
anymore, so the patch
probably does not apply, but I would still defer to your expertise

We have an SCHED_FIFO thread that falls asleep in a poll() waiting for
an interrupt from a
device driver.

1) It is the last thread on that CPU, the CPU enters idle and goes
into a 'wfi' instruction
because of cpuidle_idle_call(), and will not wake up until it receives
an interrupt of some sort (like IPI).
The interrupt occurs on a different CPU, the task is woken, but
doesn't run until much later.

2) It is not the last thread on that CPU. The CPU switches to a lower
priority task in a different class.
The interrupt occurs on a different CPU, the task is woken, but
doesn't preempt until much later.

Unfortunately I don't have the ftrace logs from back then, and it
would be hard to recreate the scenario.

If I recall correctly, the problem seemed to occur because the task
cpus_share_cache() was true, the task
was never put on the wake_list, and correspondingly the
scheduler_ipi() that was triggered by check_preempt_curr()
did nothing, when it returned it would somehow wind up in schedule()
where it did nothing because task was not TASK_RUNNING.
Does that make any amount of sense at all?

Sorry if this is a stupendous waste of time,

Regards,

Yuriy


On Thu, Sep 22, 2016 at 12:34 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Tue, Sep 20, 2016 at 06:14:34PM -0700, Yuriy Romanenko wrote:
>> From e9a304ae91fa2a4427bde7d3aea18296d0ebb27f Mon Sep 17 00:00:00 2001
>> From: Yuriy Romanenko <yromanenko@xxxxxxxxxxxxxx>
>> Date: Tue, 20 Sep 2016 17:47:28 -0700
>> Subject: [PATCH] Fix tasks being forgotten for a long time on SMP
>>
>> Observed occasional very high latency on an embedded SMP system between
>> a task becoming ready to run and actually running with low system load,
>> impacting interactive usage.
>>
>> A sched_wake() from CPUx on CPUy puts the task into the run queue and
>> marks it runnable, but does not trigger an IPI to have the scheduler
>> re-run on CPUy and see if the current task needs to get pre-empted and
>> does not wake up CPUy if it is asleep.
>>
>> This is especially evident when a CPU is in SWFI and simply does not
>> wake up even though it now has a runnable task.
>
> WTH his SWFI and which arch has that?
>
>> This is probably not the most elegant fix and definitely generates some
>> unnecessary scheduler runs, but it's better for overall latency.
>>
>> Signed-off-by: Yuriy Romanenko <yromanenko@xxxxxxxxxxxxxx>
>> ---
>> kernel/sched/core.c | 8 ++++++++
>> 1 file changed, 8 insertions(+)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 860070f..7c334b7 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1686,6 +1686,14 @@ static void ttwu_do_wakeup(struct rq *rq, struct
>> task_struct *p, int wake_flags,
>> trace_sched_wakeup(p);
>>
>> #ifdef CONFIG_SMP
>> + /*
>> + * If the task is not on the current cpu, there is a chance
>> + * the other cpu might be asleep and will not get to our task
>> + * for a really long time. Send an IPI to avoid that
>> + */
>> + if (task_cpu(p) != smp_processor_id())
>> + smp_send_reschedule(task_cpu(p));
>> +
>
> Yeah, no, this is completely insane.
>
> If the remote cpu is running the idle task, check_preempt_curr() should
> very much wake it up, if its not the idle class, it should never get
> there because there is now an actually runnable task on.
>
> Please explain in detail what happens and/or provide traces of this
> happening.