sched: race between deactivate and switch sched_info accounting?

From: Paul Turner
Date: Fri Oct 09 2009 - 22:42:32 EST



Hi,

While chasing down some performance we found inconsistenies in the
accounting of run-delay. Tracking this down I think there exists a race
condition in which a task can 're-arrive' without updating its queuing
accounting. This can lead to incorrect run_delay and rq_cpu_time
accounting.

Consider a flow such as:

1. task p blocks, enters schedule()
2. we deactivate p
3. there's nothing else so we load balance
4. during load balance we lock balance and a remote cpu succeeds in
re-activating p
5. we then re-pick next==p
6. we skip the accounting update since prev==next, start running p again
(last_arrival is left polluted with our prior arrival)

<time passes>

7. enter schedule() again, switch into new task
a. if p blocks then we'll go through deactivate which then will
charge the last period as run_delay even though we were actually
running.
b. if p doesn't block then we'll leave last_queued at the stale prior
value since last_queued != 0 in sched_info_queued.

[*] in depart our delta will cover the span since our original arrival
including some (probably largely inconsequential) extra time that
wouldn't normally be charged to us.

<time passes>

8. pick p to run again
if 7b was true then the queue delay we charge here will again include
p's prior timeslice that was spent running.

One way to work around this is to charge a switch event which accounts for
the fact that the task technically did temporarily depart and re-arrive.
A patch for this approach is below but there are definitely other ways to
handle it (there may also be additional concerns as further accounting
continues to be added).

Thanks,

- Paul

---


It's possible for our previously de-activated task to be re-activated by a
remote cpu during lock balancing. We have to account for this manually
since prev == next, yet the task just went through dequeue accounting.

Signed-off-by: Paul Turner <pjt@xxxxxxxxxx>
---
kernel/sched.c | 15 ++++++++++++---
1 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index ee61f45..6445d9d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5381,7 +5381,7 @@ asmlinkage void __sched schedule(void)
struct task_struct *prev, *next;
unsigned long *switch_count;
struct rq *rq;
- int cpu;
+ int cpu, deactivated_prev = 0;

need_resched:
preempt_disable();
@@ -5406,8 +5406,10 @@ need_resched_nonpreemptible:
if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
if (unlikely(signal_pending_state(prev->state, prev)))
prev->state = TASK_RUNNING;
- else
+ else {
deactivate_task(rq, prev, 1);
+ deactivated_prev = 1;
+ }
switch_count = &prev->nvcsw;
}

@@ -5434,8 +5436,15 @@ need_resched_nonpreemptible:
*/
cpu = smp_processor_id();
rq = cpu_rq(cpu);
- } else
+ } else {
+ /*
+ * account for our previous task being re-activated by a
+ * remote cpu.
+ */
+ if (unlikely(deactivated_prev))
+ sched_info_switch(prev, prev);
spin_unlock_irq(&rq->lock);
+ }

post_schedule(rq);

--
1.5.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/