[PATCH RFCv2 3/8] sched: Handle on_list ancestor in list_add_leaf_cfs_rq()

From: Jan H. SchÃnherr
Date: Wed Jul 27 2011 - 15:12:08 EST


From: Jan H. SchÃnherr <schnhrr@xxxxxxxxxxxxxxx>

In the case of an on_list ancestor we may incorrectly place the child to
the right of a great-ancestor on the list.

Consider:

A
/ \ Here, t1A results in A->cfs_rq being on_list, however when
B t1A we start enqueuing from C this will not be visible. This is
/ compounded by the fact that on_list expiration may also be out
C of order, punching holes in the tree.
/
t1C

Prevent this by making additions to the leaf_cfs_rq_list position
independent.

This is done by collecting additions to this list within the
enqueue_task_fair() path until we reach the top or find an on_list
ancestor. All additions are then spliced into the leaf_cfs_rq_list at
once.

The additions cannot be collected in a list with the normal means as
this violates the RCU guarantee that the next pointer of an element
always leads back to the list as long as there could be concurrent
readers. However, we are still allowed to modify the next pointer of
deleted entries as long as we make sure that they eventually return to
the list. That is, if we have to collect more than one entry in a row,
it is legal to set the next-pointer of the first collected entry to the
next one. (We can do this even without using *_rcu functions, as there
are no new elements on that path.)

Suggested-by: Paul Turner <pjt@xxxxxxxxxx>
Signed-off-by: Jan H. SchÃnherr <schnhrr@xxxxxxxxxxxxxxx>

---
Changes since v1:
- moved some more code to include/linux/rculist.h
- added comment regarding RCU
---
kernel/sched_fair.c | 93 ++++++++++++++++++++++++++++++++++++++++-----------
1 files changed, 73 insertions(+), 20 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index c768588..1487649 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -143,26 +143,75 @@ static inline struct cfs_rq *cpu_cfs_rq(struct cfs_rq *cfs_rq, int this_cpu)
return cfs_rq->tg->cfs_rq[this_cpu];
}

-static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq,
+ struct list_head *leaf_cfs_rqs)
{
- if (!cfs_rq->on_list) {
- /*
- * Ensure we either appear before our parent (if already
- * enqueued) or force our parent to appear after us when it is
- * enqueued. The fact that we always enqueue bottom-up
- * reduces this to two cases.
- */
- if (cfs_rq->tg->parent &&
- cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))]->on_list) {
- list_add_rcu(&cfs_rq->leaf_cfs_rq_list,
- &rq_of(cfs_rq)->leaf_cfs_rq_list);
- } else {
- list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
- &rq_of(cfs_rq)->leaf_cfs_rq_list);
- }
+ if (cfs_rq->on_list)
+ return;
+
+ /*
+ * Carefully collect leaf-cfs entries.
+ *
+ * There are basically two cases:
+ * 1. cfs_rq has not yet been on the leaf-list.
+ * 2. cfs_rq has been deleted previously from the leaf_list.
+ *
+ * In case 2, we might still have concurrent readers.
+ *
+ * Therefore, the requirements of list_add_tail_nobackref() are
+ * fulfilled:
+ *
+ * a) If there are concurrent readers, ->next must lead back to the
+ * list.
+ *
+ * We can only have readers after case 2. After case 2, only case 2
+ * can follow. The next pointers of case 2 nodes always lead back to
+ * the list.
+ *
+ * b) If there are concurrent readers, ->next must not lead to any
+ * already collected node.
+ *
+ * As we collect nodes always bottom-up, all already collected nodes
+ * must be below this node in the task group hierarchy. The
+ * ordering constraint of the leaf list guarantees that next
+ * pointers only lead to nodes further up in the hierarchy (or to
+ * unrelated nodes). Neither deleting nodes nor the manipulations
+ * done here change that. Thus, we cannot reach an already collected
+ * node.
+ *
+ * c) If there are concurrent readers, they must already know this
+ * node.
+ *
+ * If we have to add case 1 nodes, they are collected in the
+ * beginning and cannot be reached by readers until they are
+ * spliced. Furthermore, after they are spliced, we will not
+ * encounter more case 1 nodes higher up in the task group
+ * hierarchy. For this reason any reader on an earlier collected
+ * case 2 node must know all nodes that we collect later.
+ */
+ list_add_tail_nobackref(&cfs_rq->leaf_cfs_rq_list, leaf_cfs_rqs);

- cfs_rq->on_list = 1;
+ /*
+ * If our parent is on_list or if there is no parent, then splice all
+ * entries collected so far at the correct position into the
+ * leaf_cfs_rq_list.
+ *
+ * If our parent is not on the list, it will be collected during the
+ * next call to this function.
+ */
+ if (cfs_rq->tg->parent) {
+ int cpu = cpu_of(rq_of(cfs_rq));
+ struct cfs_rq *parent_cfs_rq = cfs_rq->tg->parent->cfs_rq[cpu];
+ if (parent_cfs_rq->on_list) {
+ list_splice_tail_init_rcu(leaf_cfs_rqs,
+ &parent_cfs_rq->leaf_cfs_rq_list);
+ }
+ } else {
+ list_splice_tail_init_rcu(leaf_cfs_rqs,
+ &rq_of(cfs_rq)->leaf_cfs_rq_list);
}
+
+ cfs_rq->on_list = 1;
}

static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
@@ -276,7 +325,8 @@ static inline struct cfs_rq *cpu_cfs_rq(struct cfs_rq *cfs_rq, int this_cpu)
return &cpu_rq(this_cpu)->cfs;
}

-static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq,
+ struct list_head *leaf_cfs_rqs)
{
}

@@ -998,8 +1048,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
__enqueue_entity(cfs_rq, se);
se->on_rq = 1;

- if (cfs_rq->nr_running == 1)
- list_add_leaf_cfs_rq(cfs_rq);
}

static void __clear_buddies_last(struct sched_entity *se)
@@ -1326,12 +1374,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
+ struct list_head leaf_cfs_rqs;
+
+ INIT_LIST_HEAD(&leaf_cfs_rqs);

for_each_sched_entity(se) {
if (se->on_rq)
break;
cfs_rq = cfs_rq_of(se);
enqueue_entity(cfs_rq, se, flags);
+ if (cfs_rq->nr_running == 1)
+ list_add_leaf_cfs_rq(cfs_rq, &leaf_cfs_rqs);
flags = ENQUEUE_WAKEUP;
}

--
1.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/